Processing of iterative operations

文档序号：1042392 发布日期：2020-10-09 浏览：35次中文

阅读说明：本技术 迭代运算的处理 (Processing of iterative operations ) 是由尼古拉斯·安德鲁·菲斯特斯里尼瓦斯·维穆里大卫·雷蒙德·鲁茨于 2020-03-16 设计创作，主要内容包括：本申请公开了迭代运算的处理。一种装置具有处理电路系统,该处理电路系统用于响应于由指令解码器对迭代运算指令的解码来执行迭代运算,该迭代运算包含处理的至少两个迭代,其中迭代取决于在先前迭代中产生的操作数。初始信息产生电路系统执行给定迭代的处理的初始部分以产生初始信息。结果产生电路系统执行给定迭代的处理的剩余部分,以使用初始信息来产生结果值。对于除了最终迭代之外的迭代,转发电路系统将结果值作为迭代运算的下一迭代的操作数转发。初始信息产生电路系统与结果产生电路系统完成当前迭代的剩余部分并行地开始执行下一迭代的初始部分,以提高性能。(The application discloses processing of iterative operations. An apparatus has processing circuitry to perform an iterative operation in response to decoding of an iterative operation instruction by an instruction decoder, the iterative operation comprising at least two iterations of processing, wherein an iteration is dependent on operands produced in a previous iteration. Initial information generation circuitry performs an initial portion of the processing for a given iteration to generate initial information. The result generation circuitry performs the remainder of the processing for a given iteration to generate a result value using the initial information. For iterations other than the final iteration, the forwarding circuitry forwards the result value as an operand for a next iteration of the iterative operation. The initial information generation circuitry begins executing the initial portion of the next iteration in parallel with the remainder of the current iteration being completed by the result generation circuitry to improve performance.)

1. An apparatus, comprising:

an instruction decoder for decoding program instructions; and

processing circuitry for performing, in response to decoding of an iterative operation instruction by the instruction decoder, an iterative operation comprising a plurality of processing iterations for which subsequent iterations other than a first iteration depend on operands produced in a previous iteration;

the processing circuitry includes:

initial information generation circuitry to perform an initial portion of the processing for a given iteration to generate initial information;

result generation circuitry to perform a remainder of the processing for the given iteration to generate a result value using the initial information determined by initial information generation circuitry; and

forwarding circuitry to forward the result value produced in the given iteration as an operand for a next iteration of the iterative operation when the given iteration is an iteration other than a final iteration; wherein:

when the result generation circuitry performs the remaining portion of the given iteration except the final iteration, the initial information generation circuitry is configured to begin performing the initial portion of the next iteration in parallel with the result generation circuitry completing the remaining portion of the current iteration.

2. The apparatus of claim 1, wherein a time delay between an initial portion of the initial information generation circuitry to begin the given iteration and an initial portion of the initial information generation circuitry to begin the next iteration is less than a time delay between the initial portion of the initial information generation circuitry to begin the given iteration and a remaining portion of the result generation circuitry to complete the given iteration.

3. The apparatus of claim 1, wherein the processing circuitry is capable of operating according to a clock signal having a clock period that is shorter than a combined time delay of the initial portion and the remaining portion.

4. The apparatus of claim 1, wherein the initial information generation circuitry is configured to perform an initial portion of the given iteration in a clock cycle that is earlier than a clock cycle in which the result generation circuitry begins performing a remaining portion of the given iteration.

5. The apparatus of claim 1, wherein the result generation circuitry is configured to perform a remaining portion of the given iteration in a single clock cycle.

6. The apparatus of claim 1, wherein subsequent iterations other than the first iteration are dependent on:

forwarding, by the forwarding circuitry, the forwarded operand based on the result value from the previous iteration; and

a second operand independent of a result of the previous iteration.

7. The apparatus of claim 6, wherein the initial information comprises information derived from the second operand.

8. The apparatus of claim 1, wherein subsequent iterations other than the first iteration depend on forwarded operands being forwarded by the forwarding circuitry based on result values from previous iterations, and the initial information comprises information derived from earlier information about the forwarded operands that is available before the forwarded operands are determined entirely by the result generation circuitry for the previous iterations.

9. The apparatus of claim 8, wherein the forwarded operand is a floating point operand having a sign, an exponent, and a significand; and is

The early information comprises information derived from at least one of a sign and an exponent of the forwarded operand.

10. The apparatus of claim 1, wherein the result value of the given iteration comprises a rounded floating-point value obtained by adding floating-point operands.

11. The apparatus of claim 10, wherein the initial information comprises at least one of:

an indication of whether one of the floating-point operands is sub-normal;

an indication of whether one of the floating-point operands is infinite;

an indication of whether one of the floating-point operands is a non-numeric value;

at least a portion of an exponent difference value indicating a difference between exponents of the floating point operand;

an indication of whether a fractional portion of a significand of one of the floating-point operands is zero;

an indication of whether the floating-point operand is a sign-like operand or a non-sign-like operand.

12. The apparatus of claim 1, wherein the iterative operation instruction identifies a scalar floating-point operand and a vector data value comprising a plurality of data elements, each data element specifying a floating-point operand;

each significant iteration of the iterative operation includes adding a first floating-point operand to a second floating-point operand;

for a first significant iteration of the iterative operation, the first floating-point operand is the scalar floating-point operand and the second floating-point operand is a floating-point operand specified in a first significant-data element of the vector data value; and is

For a subsequent significant iteration of the iterative operation, the first floating-point operand is a forwarded result of an addition in a previous significant iteration, and the second floating-point operand is a floating-point operand specified in a subsequent significant-data element of the vector data value.

13. The apparatus of claim 12, wherein the processing circuitry is configured to determine which iterations are valid iterations based on predicate information that specifies which data elements of the vector data value are valid data elements.

14. The apparatus of claim 12, wherein the forwarding circuitry comprises a forwarding loop signal path to provide a forwarded result of the addition in the previous valid iteration as a first floating point operand of the subsequent valid iteration;

the apparatus comprises a multiplexer to select when to inject the scalar floating-point operand as the first floating-point operand;

the multiplexer is located outside the forward loop signal path.

15. The apparatus of claim 14, wherein the forward loop signal path comprises OR circuitry to provide a first floating point operand to be processed in a given iteration;

the OR circuitry is configured to combine the first input with the second input according to a bitwise OR operation;

in an iteration other than a first valid iteration, the first input comprises a forwarded result of an addition performed by the result generation circuitry in the previous valid iteration; and is

The second input comprises an output of the multiplexer.

16. The apparatus of claim 15, wherein, in a final iteration of the iterative operation for a first iterative operation instruction, the processing circuitry is configured to provide an all-zero value on the forward loop signal path to be used as a first input of an or gate for a first valid iteration of a subsequent iterative operation instruction.

17. The apparatus of claim 15, wherein the multiplexer is configured to select a scalar floating point operand or a forwarded result in an earlier iterative operation instruction in a first significant iteration, and select all-zero values in one or more subsequent significant iterations.

18. An apparatus, comprising:

means for decoding program instructions; and

means for processing, in response to decoding of an iterative operation instruction by an instruction decoder, to perform an iterative operation comprising a plurality of processing iterations for which iterations other than a first iteration depend on operands produced in a previous iteration;

the means for processing comprises:

an initial information generation module for generating initial information by performing an initial part of a process of a given iteration;

a result value generation module for generating a result value by performing a remaining part of the processing of the given iteration, the result value being generated using the initial information determined by the initial information generation module; and

a forwarding module, configured to forward, when the given iteration is an iteration other than a final iteration, a result value generated in the given iteration as an operand of a next iteration of the iterative operation; wherein:

when the result value generation module performs a remaining portion of the given iteration other than the final iteration, the initial information generation module is configured to begin performing the initial portion of the next iteration in parallel with the result value generation module completing the remaining portion of the current iteration.

19. A method of data processing, comprising:

decoding the iterative operation instruction by using an instruction decoder; and

in response to decoding of the iterative operation instruction, performing an iterative operation comprising a plurality of processing iterations for which iterations other than a first iteration depend on operands produced in a previous iteration;

wherein the iterative operation is performed using processing circuitry comprising:

initial information generation circuitry to perform an initial portion of the processing for a given iteration to generate initial information;

forwarding circuitry to forward a result value produced in the given iteration as an operand for a next iteration of the iterative operation when the given iteration is an iteration other than a final iteration; wherein:

when the result generation circuitry performs the remaining portion of the given iteration, except for the final iteration, the initial information generation circuitry begins performing the initial portion of the next iteration in parallel with the result generation circuitry completing the remaining portion of the current iteration.

Technical Field

The present technology relates to the field of data processing.

Background

A data processing apparatus such as a microprocessor may have: an instruction decoder for decoding program instructions; and processing circuitry to perform processing operations in response to decoding of the program instructions. Challenges may exist in designing processing circuit logic so that operations for a given instruction may be completed within timing constraints imposed by the clock frequency that is desired to be supported.

Disclosure of Invention

At least some examples provide an apparatus comprising:

an instruction decoder for decoding program instructions; and

processing circuitry for, in response to decoding of an iterative operation instruction by an instruction decoder, performing an iterative operation comprising a plurality of processing iterations for which subsequent iterations other than a first iteration depend on operands produced in a previous iteration;

the processing circuitry includes:

initial information generation circuitry to perform an initial portion of the processing for a given iteration to generate initial information;

result generation circuitry to perform a remainder of the processing for a given iteration to generate a result value using initial information determined by the initial information generation circuitry; and

forwarding circuitry for forwarding, when the given iteration is an iteration other than the final iteration, a result value produced in the given iteration as an operand for a next iteration of the iterative operation; wherein:

when the result generation circuitry performs the remaining portion of the given iteration, except for the final iteration, the initial information generation circuitry is configured to begin performing the initial portion of the next iteration in parallel with the result generation circuitry completing the remaining portion of the current iteration.

At least some examples provide an apparatus comprising:

means for decoding program instructions; and

the module for processing includes:

means for generating initial information by performing an initial portion of the processing for a given iteration;

means for generating a result value by performing a remainder of the processing for a given iteration, the means using initial information determined by the initial information generation circuitry; and

means for forwarding, when the given iteration is an iteration other than the final iteration, a result value produced in the given iteration as an operand for a next iteration of the iterative operation; wherein:

At least some examples provide a data processing method comprising:

decoding an iterative operation instruction using an instruction decoder; and is

In response to decoding of an iterative operation instruction, performing an iterative operation comprising a plurality of processing iterations for which iterations other than a first iteration depend on operands produced in a previous iteration;

wherein the iterative operation is performed using processing circuitry, the processing circuitry comprising:

initial information generation circuitry to perform an initial portion of the processing for a given iteration to generate initial information;

Further aspects, features and advantages of the present technology will become apparent from the following description of examples, which is to be read in connection with the accompanying drawings.

Drawings

Fig. 1 schematically illustrates an example of a data processing apparatus;

FIG. 2 illustrates an example of a floating-point reduction operation, which is an example of an iterative operation;

FIG. 3 shows an example of a portion of a floating-point addition unit for adding two floating-point operands;

FIG. 4 illustrates an exemplary timing diagram showing, for comparison, a method in which there is no overlap in the processing of successive iterations of an iterative operation;

FIG. 5 illustrates a portion of processing circuitry that supports iterative operations, where an initial portion of processing for a given iteration may be performed in parallel with a remaining portion of completing a previous iteration;

FIG. 6 shows a timing diagram illustrating an example of performing an iterative operation using the processing circuitry of FIG. 5;

FIG. 7 illustrates a floating point reduction unit for performing the floating point reduction operation of FIG. 2; and

fig. 8 is a flow chart illustrating a method of performing an iterative operation.

Detailed Description

The processing circuitry may perform data processing operations in response to program instructions decoded by the instruction decoder. When designing the circuit logic of processing circuitry, there may be a constant design challenge in balancing the following two aspects: on the one hand, the desire to support certain relatively complex instructions that may require circuit logic with greater logic depth, but will therefore incur greater processing delays as the signal passes through the circuit logic; and, on the other hand, the clock frequency is increased as much as possible to provide the desire for improved performance. Relatively complex operations requiring greater logic depth may be separated across multiple processing cycles, but if a greater number of cycles are used, the processing workload dependent on such operations may be slowed down. This may be a particular problem for certain iterative operation instructions that require two or more processing iterations, since if each iteration must be separated over a larger number of cycles, a corresponding performance impact is incurred multiple times for the same instruction. Another approach to addressing this design challenge may be to reduce the maximum clock frequency supported, thereby allowing more time for more complex operations to complete within a certain number of clock cycles, but in this case, the reduced clock frequency will reduce the performance of other types of operations that will require less logic depth and may complete in time at a faster clock frequency.

Techniques for improving the performance per iteration of an iterative operation instruction that, when decoded by an instruction decoder, controls processing circuitry to perform an iterative operation that includes two or more processing iterations for which subsequent iterations, other than a first iteration of the iterative operation, depend on operands generated in a previous iteration, are discussed below.

The processing circuitry includes initial information generation circuitry to generate initial information by performing an initial portion of the processing for a given iteration of the iterative operation. The result generation circuitry performs the remainder of the processing for a given iteration to generate a result value using the initial information determined by the initial information generation circuitry. When the given iteration is an iteration other than the final iteration of the iterative operation, the forwarding circuitry forwards the result value produced in the given iteration as an operand for the next iteration of the iterative operation.

For iterative operations in which the result of an earlier iteration is forwarded for a subsequent iteration, we consider that it is not possible to start the next iteration until the result of the previous iteration has been generated. However, contrary to intuition, the inventors have recognized that it is possible for the initial information generation circuitry to begin executing the initial portion of the processing for the next iteration in parallel with the result generation circuitry completing the remainder of the current iteration. Because each iteration latency may be less than the combined latency of the initial and remaining portions of the process, the performance of the enable is improved so that the iterative operation instruction can be executed on processing circuitry that supports a higher clock frequency than would otherwise be possible. This provides a performance benefit not only for the iterated operation instruction itself, but also for other instructions, since other instructions may also benefit from a higher clock frequency when the iterated operation instruction supports the higher clock frequency.

The time delay between the initial portion of the initial information generation circuitry starting a given iteration and the initial portion of the initial information generation circuitry starting the next iteration may be less than the time delay between the initial portion of the initial information generation circuitry starting the given iteration and the remaining portion of the result generation circuitry completing the given iteration. This means that the processing circuitry is capable of operating on a clock signal having a clock period that is shorter than the combined time delay of the initial portion and the remaining portion. The delay from the beginning of the initial portion of a given iteration to the beginning of the initial portion of the next iteration may correspond to N clock cycles, where N is 1 or greater, but due to the overlap between the completion of the initial portion of the next iteration and the remainder of the current iteration, this means that the overall processing (combination of the initial and remainder) for a given iteration actually has more than N cycles to complete, so that a more aggressive scaling of the clock frequency becomes possible.

The initial information generating circuitry may perform an initial portion of the given iteration in a clock cycle that is earlier than a clock cycle in which the result generating circuitry began performing a remaining portion of the given iteration. Thus, for a first iteration of execution of the instruction for the iterative operation, an initial portion of the first iteration may be executed in a set clock cycle, and then in a next clock cycle, a remaining portion may be started for the given iteration, where the initial portion of the next iteration is executed in parallel with the end of the remaining portion of the given iteration.

This technique may be applied to operations that require two or more clock cycles for the remainder of the processing, i.e., where N is equal to 2 or greater in the examples discussed above. However, this is particularly useful for operations that have a logic depth that is slightly deeper than can be accommodated by a single clock cycle, as this technique may then allow operations with too much logic in one cycle to still be processed with delay scaling per iteration on a single cycle basis, which may be extremely beneficial for performance. In particular, for implementations in which N ═ 1 (i.e., the initial portion of one iteration begins one cycle after the initial portion of the previous iteration), the total number of clock cycles used to generate results may be M +1 (one set-up cycle and M per-iteration result generation cycles), where M is the number of iterations in the iterative operation.

The initial portion of the processing for a given iteration may include generating information about the operands to be processed in that iteration. For example, some initial processing of operands may be required to determine which logic paths within the result generation circuitry are required, or to generate configuration signals for configuring certain logic elements of the result generation circuitry (such as determining shift amounts of shifters, e.g., partial or full exponent differences for effective alignment).

One might ask why iterative operations involving dependencies between iterations (and thus requiring forwarding of the results of an iteration as operands for the next iteration) are likely to be performed with the initial portion of the next iteration being performed in parallel with the end of the previous iteration. However, the inventors have realized that several options for parallelization can be implemented between iterations, despite the forwarding of operands for the next iteration being based on the results of the previous iteration.

In one example, for a subsequent iteration other than the first iteration, the subsequent iteration may depend not only on forwarded operands forwarded by the forwarding circuitry based on the result value from the previous iteration, but also on a second operand unrelated to the result of the previous iteration. In this case, the initial information determined by the initial information generation circuitry may include information derived from the second operand. Because the second operand is independent of the result of the previous iteration, the second operand is available before the result of the previous iteration has been generated. The initial information generation circuitry may thus begin deriving information from the second operand before completing the remainder of the processing of the previous iteration so that the logic depth of the remainder of the processing may be shallower as it need not include logic for determining initial information from the second operand.

Alternatively or additionally to deriving the initial information from the second independent operand, another approach may be: the initial information may be information derived from earlier information on the forwarded operand that is available before the forwarded operand has been completely determined by the result generation circuitry for the previous iteration. For example, some partial information on the operand to be forwarded to the next iteration may be obtained early, and thus some initial information may be derived from the early information for the next iteration of the iterative operation, in parallel with the ongoing computation of the complete value of the forwarded operand in the previous iteration.

For example, the forwarded operands may be floating point operands having a sign, an exponent, and a significand. Generally, the sign and/or exponent of a floating point result may be available earlier than the significand because calculating the significand may require a greater logic depth than calculating the exponent or sign. Thus, early information on forwarded operand returns may include information derived from at least one of the sign or exponent of the forwarded operand. Thus, before the complete forwarded operand including the significand has been calculated for the previous iteration, initial information about the operand to be processed in the next iteration may be pre-calculated to speed up the processing of the next iteration.

In the case where there is also a second operand that is unrelated to the result of the previous iteration, the early information about the forwarded operand may be compared to information derived from the second operand to provide additional initial information. For example, an early available exponent of a forwarded operand may be compared to an exponent of a second independent operand to generate at least one bit of an exponent difference value, which may be useful to speed up floating point addition in a next iteration.

The present techniques may be particularly useful for iterative operations whose result values for a given iteration include rounded floating-point values obtained by adding floating-point operands. Floating point addition may be relatively slow and require a relatively large logic depth because addition of a floating point operand may require initial analysis of the exponents of the floating point operand, alignment of the significands of the operand based on differences between the exponents, addition of the aligned significands, and rounding to generate a rounded floating point result. Thus, floating point addition may generally provide some of the most stringent timing constraints when designing the processing logic of a given processor. In iterative operations where each iteration requires floating-point addition, the techniques discussed above may be very useful for accelerating performance by reducing the latency per iteration, which is obtained by overlapping part of the initial information computation with the remainder to produce rounded floating-point values in previous iterations.

For such an iterative operation involving floating-point addition, the initial information may include various different types of information about the floating-point operands to be added. For example, the initial information may include any one or more of:

an indication of whether one of the floating-point operands is sub-normal;

an indication of whether one of the floating-point operands is infinite;

an indication of whether one of the floating-point operands is a non-numeric value;

at least a portion of an exponent difference value indicating a difference between exponents of a floating point operand;

an indication of whether a fractional portion of a significand of one of the floating point operands is zero;

an indication of whether a floating-point operand is a signed-like operand or a signed-non-like operand.

All of these types of information can be used to control floating point addition. By computing any of these types of information in parallel with the end of the previous iteration, the overall logic depth for the remainder is thus reduced, and performance is therefore improved by enabling a higher clock frequency.

Although the techniques of this disclosure may be used for any iterative operation involving floating point addition, the techniques may be particularly useful for floating point reduction operations. For this operation, the iterative operation instruction may identify a scalar floating-point operand and a vector data value including a plurality of data elements, each specifying a floating-point operand. For this instruction, each valid iteration (active iteration) of the iterative operation may include adding a first floating-point operand to a second floating-point operand. For a first valid iteration of the valid iterations, the first floating-point operand is a scalar floating-point operand, and the second floating-point operand is a floating-point operand specified in a first valid data element of the vector data value. For a subsequent valid iteration of the valid iteration, the first floating point operand is a forwarded result of the addition in a previous valid iteration, and the second floating point operand is a floating point operand specified in a subsequent valid data element of the vector data value. Assertion information may be provided to specify which data elements in the vector data values are valid data elements. Each iteration takes the corresponding element of the vector as its second floating-point operand, and if the vector is marked as valid by the predicate information, this indicates that the corresponding iteration is a valid iteration, whereas an iteration for which the predicate information indicates that the corresponding data element of the vector is not valid is considered a non-valid iteration. For non-valid iterations, a flip-flop in the processing logic may disable its enable signal to prevent it from updating based on non-valid vector elements.

A processing system may support vector operations, for which multiple lanes (which may also be referred to as Single Instruction Multiple Data (SIMD) or SIMD) may perform processing in response to a single instruction, each lane using a respective data element of at least one vector operand. Some SIMD implementations may provide hardware for processing multiple vector channels in parallel, while other circuit implementations may execute at least some of the vector channels one after another sequentially or in blocks. Regardless, using vector processing can increase code density and improve performance by reducing the burden in capturing and decoding instructions, as compared to equivalent scalar codes that require separate instructions for each channel of processing.

One use of such vector instructions may be to perform several channels of floating point processing that accumulate their results into various elements of the vector, but ultimately it may be desirable to add all of the elements in the vector together to produce a scalar result, in which case the vector reduction operation discussed above may be useful. Including scalar operands in the addition may be useful because it means that the vector reduction instruction may add all elements of the vector operand into a scalar result obtained by a previous vector reduction instruction based on a different vector operand, so that by executing several vector reduction instructions in succession, elements from multiple vectors may be accumulated together.

However, since floating-point addition is non-associative, it may be important that the addition of the various vector elements be performed in a predefined order to ensure that the result of adding the elements is predictable/repeatable. Thus, iterations that add each successive element of the vector to a scalar operand or the result of an earlier iteration may need to be performed sequentially. This means that the performance of a single iteration may be important, since any speed reduction of a single iteration has an effect on the overall instruction, since its delay occurs multiple times. However, because each iteration involves floating point addition, floating point addition may be a relatively slow operation compared to other types of processing operations, and thus timing constraints may be relatively strict. The techniques discussed above may be very useful for such vector reduction operations to allow for a smaller logic depth in a cycle, since some initial information for the next cycle is computed in parallel with the results for the earlier cycles to improve the performance of the system as a whole and support higher clock frequencies.

One problem with this vector reduction operation is that with respect to the first significant iteration, the scalar floating-point operand is injected as a first floating-point operand, and with respect to subsequent cycles, the first floating-point operand is a forwarded operand forwarded from the result of the previous significant iteration. It is believed that this would require a multiplexer to select when to inject the scalar floating point operand as the first floating point operand, and that the multiplexer should be placed on a forwarding loop signal path that provides the forwarded result of the addition in the previous valid iteration as the first floating point operand to be input for the subsequent valid iteration. In practice, however, the multiplexer may incur a relatively large logic depth and this added delay to the forwarding loop signal path may be sufficient to make it impractical to perform a single iteration of a vector reduction operation in a single cycle when clocked at some higher clock frequency.

Alternatively, as discussed further below, a multiplexer for selecting when to inject the scalar floating point may be located outside of a forwarding loop signal path that provides the forwarded result of the addition in the previous valid iteration as the first floating point operand of the subsequent valid iteration. Thus, the forwarded result may only be input to the next iteration without passing through the multiplexer, and the multiplexer drops out of the critical timing path through the forwarding loop. Alternatively, the forward loop signal path may include or circuitry that provides a first floating point operand to be processed in a given iteration. The OR circuitry may combine the first input with the second input according to a bitwise OR operation. In iterations other than the first valid operation, the first input of the or circuitry may be a forwarded result of an addition performed in a previous valid iteration. The second input may comprise an output of a multiplexer that selects between scalar operands, all-zero values, and optionally other values, such as forwarded results from a final iteration of a previous example of performing a vector reduction (iterative operation) instruction. This improves performance since the or circuitry typically has a shallower logic depth than the multiplexer, because the latency along the forward loop signal path is shorter, making it feasible to perform a single cycle of the floating point vector reduction operation at a higher clock frequency.

In order to be able to use or circuit logic instead of multiplexers on critical timing paths, the result generation circuit and the multiplexers themselves may simultaneously ensure that in cycles where both are not intended to provide the value to be used as the first floating point operand, the result generation circuit and multiplexers output a zero value so that when ored with outputs from other paths, the or result will be correct. Thus, in a final iteration of the iterative operation for the first iterative operation instruction, the processing circuitry may provide an all-zero value on the forward loop signal path to be used as the first input of the or gate for the first significant iteration of the subsequent iterative operation instruction. On the other hand, the multiplexer may select a scalar floating-point operand (or a forwarded result from a final iteration of an earlier iterative operation instruction) in a first valid iteration of a given iterative operation instruction, and an all-zero value in one or more subsequent valid iterations. Thus, in each iteration, either the output of the multiplexer or the output on the forwarding loop signal path is zero, while the other (unless the output of the active addition iteration is a true zero returned by the forwarding circuitry) is non-zero (it is not possible for both outputs to be non-zero at the same time). In this way, the presence of multiplexers on critical timing paths can be avoided, replaced with OR circuitry to improve performance and reduce latency.

Another way to avoid the latency penalty of the multiplexer on the forwarding loop signal path may be to inject a scalar floating point operand into the output of the forwarding circuitry during a setup processing cycle via a signal path having a shorter logic depth than the logic depth via the result generation circuitry used to generate the actual floating point addition result. Thus, during a setup processing cycle, the output of the forwarding circuitry (which will provide the floating point addition result for a given iteration in a subsequent cycle) may instead output the scalar floating point operand, so that in a subsequent clock cycle, the result generation circuitry may use the scalar floating point operand as the first floating point operand of the first iteration of the loop to perform the remainder of the processing. The method may remove OR circuits and multiplexers from critical timing paths of the forwarding loop.

Floating point representation

Floating-point (FP) is a useful way to approximate real numbers using a small number of bits. The IEEE754-2008FP standard provides a number of different formats for FP numbers, with some formats being 64 bits binary (also known as double precision, or (DP)), 32 bits binary (also known as single precision, or (SP)), and 16 bits binary (also known as half precision, or (HP)). The numbers 64, 32 and 16 represent the number of bits required for each format.

The FP number is very similar to the "scientific notation" taught in the scientific lesson, where we write not-2,000,000, but-2.0 × 10⁶. The parts of this value are the sign (negative in this case), the significand (2.0), the base of the exponent (10), and the exponent (6). All of these fractions are similar in FP numbers, although there are differences; the most important point is that each component is stored as a binary number, and the base of the exponent is always 2.

More precisely, the FP number consists of a sign bit, some biased exponent bits, and some fraction bytes. Some FP formats of interest provide the following:

	symbol	Index of refraction	Score of	Index offset
					DP[63:0]	63	62:52(11 bit)	51:0(52 bit)	1023
SP[31:0]	31	30:23(8 bit)	22:0(23 bit)	127
					HP[15:0]	15	14:10(5 position)	9:0(10 bit)	15

The sign of a negative number is 1 and the sign of a positive number is 0. Each number (including zero) has a sign.

The index offset indicates that the true index is different from the index stored in the number. For example, the skewed SP exponent is 8 bits long and ranges from 0 to 255. The indices 0 and 255 are special cases, but all other indices have a bias of 127, meaning that the true index is less than the biased index 127. The minimum skewed index is 1, corresponding to the true index-126. The maximum biased index is 254, corresponding to the true index 127. The HP and DP indices work in the same manner, with the offsets indicated in the table above.

The SP index 255 (or DP index 2047, or HP index 31) is reserved for a special notation of infinity and called NaN (not a number). The fraction of infinity (which may be positive or negative) is zero. Any number with an index of 255 and a non-zero fraction is NaN. Infinity provides a saturated value and thus this value actually represents a situation such as "the number resulting from this calculation is larger than we can represent in this format". NaN will return for operations that are not mathematically defined on real numbers, such as zero divisor or taking the square root of a negative number.

In any format, the exponent zero is reserved for sub-normal numbers and zero. The normal numbers represent the following values: -1^Symbol× 1 score × 2^eWhere e is an index calculated from a biased index (biased exponent). The term "1. fraction" is referred to as the significand, and 1 is not stored as part of the FP number, but is inferred from the exponent. All indices, except zero and the largest index, indicate a significand of the form "1. fraction". The exponent zero indicates a significand form of "0. fraction", and the true exponent equals the 1-offset for the given format. This number is called sub-formal (historically such numbers are called denormal (denormal), but modern usage is inclined to the term sub-formal). Numbers with an index and a score equal to zero are zero.

The following table has exemplary numbers in the HP format. The entries are in binary format and have an "_" character added to increase readability. It should be noted that the sub-regular entry (row 4 in the table, with an exponent of zero) produces a different number of significands than the regular entry in the previous row.

	5 bit index	Fraction of 10 bits	11 significant digit	Value of
					0	01111	00_0000_0000	100_0000_0000	1.0X 2⁰
1	01110	10_0000_0000	110_0000_0000	-1.1X 2^-1
					0	00001	10_0000_0000	110_0000_0000	1.1X 2^-14
0	00000	10_0000_0000	010_0000_0000	0.1X 2^-14
					1	11111	00_0000_0000		-infinity
0	11111	00_1111_0011		NaN

The majority of the complexity of FP implementations is due to sub-formal, so it is typically handled by microcode or software. Processing circuitry that handles sub-normalization in hardware may speed up these operations by a factor of 10 to 100 compared to software or microcode implementations.

Integer, fixed point, floating point

The FP way of handling symbols is called symbol magnitude and is different from the way in which usual integers are stored in a computer (two's complement). In the sign-magnitude representation, the same number of positive and negative versions differ only in sign bit. A 4-bit sign-magnitude integer (consisting of a sign bit and 3 significant bits bytes) represents the addition of one to the order:

+1＝0001

-1＝1001

in a two's complement representation, the n-bit integer i is represented by a binary n + 1-bit value 2ⁿ+ i is represented by the lower order n bits, so a 4-bit two's complement integer will represent adding one minus one:

+1＝0001

-1＝1111

the two's complement format is common in practice for signed integers because it simplifies computer operations.

The fixed-point number looks exactly like an integer, but in practice it represents a value with a certain number of fractional bits. Sensor data is typically in a fixed-point format, and prior to widespread adoption of FPs, there was a large amount of written fixed-point software. The use of fixed point numbers is cumbersome because the programmer must track the "binary points," i.e., the separators between the integer and fractional parts of the number, and also must continually shift the number to keep the bit in the correct position. FP numbers do not have this difficulty and it is therefore desirable to be able to convert between fixed point numbers and FP numbers. Being able to convert means we can still use fixed point software and data, but is not limited to fixed points when writing new software.

Rounding of FP numbers

The IEEE-754 standard requires that most FP operations be calculated as if the operation were performed with unlimited range and precision, and then rounded to fit the FP number. If the calculation matches the FP number completely, the value is always returned, as typically the calculation yields a value that lies between two consecutive floating point numbers. Rounding is the process of selecting which of two consecutive numbers should be returned.

There are multiple rounding modes called rounding modes; six such rounding modes are as follows:

	mode(s)	Definition of
			RNE	Nearest rounding, taking even values	Taking the closest values, or even values if the two values are equally close
RNA	Nearest rounding, taking a farther value	Taking the value closest or, if the two values are equally close, the value furthest from zero
			RZ	Rounding to zero	Take the value nearest zero
RP	Forward infinite rounding	Taking the value nearest positive infinity
			RM	Round to negative infinity	Take the value nearest negative infinity
RX	Round to odd numbers	Take odd values

This definition does not tell us how to trade off in any practical way. One common implementation is to perform an operation that looks up the cutoff value (i.e., the value that fits in the FP format) and all remaining bits, and then adjusts the cutoff value if certain conditions are met. All calculations are based on:

least significant bit of L (lowest) truncation value

G (guard) the next most significant bit (i.e., the first bit not included in the truncation)

S (viscosity) is not the logical or of all the remaining bits of a portion of the truncation.

Given these three values and the cutoff value, we can calculate the correctly rounded values according to the following table (table 1):

TABLE 1

Mode(s)	Change to cutoff value
		RNE	If (L)&G)\|(G&S), then increase by one
RNA	If G, increment
		RZ	Is free of
RP	If it is positive number&(G \| S), then increment
		RM	If it is negative&(G \| S), then increment
RX	If G \| S, then set L

For example, consider multiplying two 4-bit significands and then rounding to the 4-bit significand.

sig1 ═ 1011 (decimal 11)

sig2 ═ 0111 (decimal 7)

Multiplying the gains

sig1 × sig2 ═ 1001_101 (decimal 77)

L Gss

The least significant bit of the truncated 4-bit result is labeled L, the next bit G, and S is the logical or of the remaining bits labeled S (i.e., S0 |1 ═ 1). For rounding, we adjust our 4-bit result according to the rounding modes and calculations in the table above (1001). Thus, for example, in RNA rounding, G is set, so we return 1001+1 to 1010. For RX rounding, G | S is true, so we set L to 1 (which is already 1, so there is no change in this case) and return to 1001.

Examples of the invention

Fig. 1 schematically illustrates an example of a data processing apparatus 2. The data processing device has a processing pipeline 4 comprising several pipeline stages. In this example, the pipeline stage includes: a fetch stage 6 for fetching instructions from an instruction cache 8; a decode stage 10 for decoding fetch program instructions to generate micro-operations to be processed by the remaining stages of the pipeline; an issue stage 12 for checking whether operands required for micro-operations are available in the register file 14 and issuing micro-operations for execution once the required operands for a given micro-operation are available; an execution stage 16 for performing data processing operations corresponding to micro-operations by processing operands read from the register file 14 to produce result values; and a write back stage 18 for writing back the results of the processing to the register file 14. It will be appreciated that this is only one example of a possible pipeline architecture, and other systems may have additional stages or differently configured stages. For example, in an out-of-order processor, an additional register renaming stage may be included for mapping architectural registers specified by program instructions or micro-operations to physical register specifiers that identify physical registers in register file 14.

The execution stage 16 comprises a plurality of processing units for performing different classes of processing operations. For example, the execution units may include a scalar arithmetic/logic unit (ALU) 20 to perform arithmetic or logical operations on scalar operands read from a scalar register file 21; a floating-point unit 22 for performing operations on floating-point values; a branch unit 24 for evaluating the result of the branch operation and adjusting the program counter representing the current execution point accordingly; and a load/store unit 28 for performing load/store operations to access data in the memory systems 8, 30, 32, 34. In this example, the memory system includes a first level data cache 30, a first level instruction cache 8, a shared second level cache 32, and main system memory 34. It will be appreciated that this is just one example of a possible memory hierarchy, and that other arrangements of cache may be provided. The particular type of processing unit 20-28 shown in the execution stage 16 is only one example, and other implementations may have different sets of processing units or may include multiple instances of the same type of processing unit, so that multiple micro-operations of the same type may be handled in parallel. It will be appreciated that fig. 1 is only a simplified representation of some elements of a possible processor pipeline architecture, and that a processor may include many other elements that are not shown for the sake of brevity, such as a branch prediction mechanism or an address translation or memory management mechanism.

Floating point unit 22 supports processing operations performed on operands represented in a floating point representation. The floating-point operations may include operations performed on scalar floating-point operands stored within scalar registers, or on floating-point operands of corresponding data elements of vector operands stored in vector registers. Some operations may also handle both scalar and vector operands.

One type of floating-point operation that may be supported by floating-point unit 22 may be a floating-point vector reduction operation; this operation may be controlled by the vector reduction instruction FADDA as shown in fig. 2. The FADDA instruction is an example of an iterative operation instruction as discussed above. The fadd instruction specifies a scalar destination register Rd, a vector source register Za, a scalar source register Rb, a predicate register Pg. The scalar source registers Rb include scalar floating-point operands Sc that include a sign, a significand, and an exponent as discussed above. Furthermore, each element V of the vector source register Za also contains a floating-point operand, which again contains a sign, a significand, and an exponent. The predicate register Pg includes a series of predicate indications that each indicate whether a corresponding element of the source vector register Za is valid or invalid. In this example, vector element V [1] is indicated as invalid because its corresponding predicate flag in position 1 of the predicate register is 0; while the other vector elements of Za are all valid because their asserted flag is 1.

The lower part of fig. 2 shows the operation to be performed for the FADDA instruction. In the first iteration (iteration 0), the scalar operand Sc in register Rb is added to the first significant element V [0] in vector register Za using floating point addition including rounding and normalization to produce an iteration 0 result Sc + V [0 ]. In this example, the next iteration (iteration 1) would normally add the result of iteration 0 to vector element V [1], but since the V [1] element is masked invalid by the assertion, the output of iteration 0 is kept as the output of iteration 1 without adding in the next vector element V [1 ]. For subsequent iterations 2 and 3 shown in FIG. 2, successive significant elements of the vector, V [2], V [3], are added one at a time to the results of the previous iteration, again using floating point addition including rounding and normalization, until a final result is produced; in this example, the final result corresponds to the sum of the scalar operand Sc and three significant elements V [0], V [2], and V [3 ]. This final result is stored into the destination register Rd. In practice, it may be relatively common to execute several fadd instructions in succession, with different source vectors such as vector register Za, and destination register Rd for one fadd instruction becoming the source scalar register Rb for the next instruction to add together vector elements spanning multiple vectors.

Floating-point addition is non-associative, in order to provide predictable results, the iterations must be performed in an order corresponding to the order of the vector elements in the source vector, and the iterations cannot be performed in parallel, so that this type of operation is relatively sensitive to any increased latency associated with performing floating-point addition for a single iteration, since any increased latency will occur multiple times, once per iteration. Accordingly, it may be desirable to provide techniques for reducing per-iteration latency for this instruction.

For comparison, FIG. 3 shows an example of processing logic that may be provided within floating point unit 22 for supporting this type of floating point addition. Fig. 3 illustrates the logic of a single iteration, and forwarding of operands back to another iteration is not shown for the sake of brevity. In general, the addition may involve first calculation information about the operands opa, opb to be added using logic 30. For example, the exponent difference between the exponents of the operands may be calculated, as well as information regarding whether the exponent/significand has certain special values that indicate that the floating point operands are sub-normal, infinite, NaN, etc., as discussed above. This calculated information may then be used to configure how the add/round circuitry 32 or special value handling system 34 processes the operands to produce the result.

Depending on the calculated information for the operands, the addition/rounding circuitry 32 may include near path logic 36 and far path logic 38 that handle different types of processing. Floating-point addition may be considered to represent either a like-signed-addition (LSA) or a non-like-signed-addition (USA). Performing an LSA when an addition operation acts on two operands having the same value of sign or when a subtraction operation acts on a value having a different sign (the two are actually additions); while USA is an addition operation on floating-point operands having different signs or a subtraction operation on operands having the same sign. The far-path logic 38 may be designed to handle all LSAs and those USAs that act on operands whose exponent differences are greater than 1. Far path logic 38 may include processing to align the significands of the operands based on the exponent difference, add the aligned significands, and round the result. On the other hand, near path logic 36 may handle USAs where the exponents are the same or differ only by 1, in which case logic may be needed for adjusting the exponents based on the number of leading zeros in the addition result, and for handling potentially sub-regular outputs.

Thus, in general, floating point addition may require some information about the operands to be calculated to decide how to process the operands, e.g., so that it may be selected whether the near path 36 or the far path 38 is required; and for configuring other portions of the operation, such as the amount of shift required for alignment in far path 38; or special value handling 34 for deciding whether a special floating point value, such as infinity or NaN, is required.

As shown in fig. 4, if floating-point addition is included in an iterative operation such as the vector reduction operation of fig. 2, one would consider that in each cycle of the iterative operation, it would be necessary to first compute information about the operands to be processed in that cycle and then add the operands based on the computed information about the operands. The logic 30 for calculating information about operands and the subsequent addition circuitry 32 may both require a certain logic depth so that the minimum clock period C1 (and thus the maximum clock frequency) that can be supported by the processing logic may be constrained based on the combined logic depth of the circuit logic 30, 32.

Fig. 5 illustrates an alternative approach. FIG. 5 illustrates a portion of floating-point unit 22 used to support processing of the floating-point vector reduction (floating-point-vector reduction) shown in FIG. 2. The floating-point vector reduction logic includes two stages, stage 0 and stage 1. Stage 1 includes most of the circuit logic for performing floating point addition (e.g., near/far path logic). For both the near path and far path logic (which are not distinguished in fig. 5 for simplicity), the addition circuit logic may include: significand processing logic 50 which generates the significand of the result of adding the two floating point operands opa, opb; exponent and sign processing logic 52 which generates the exponent and sign of the result of adding the operands OPA, OPB. These are located in execute stage 1.

Selection circuitry 54 is provided to select which operands are provided as first floating point operands opa to significand processing logic 50 and exponent/sign processing logic 52. In an initial cycle of the iterative floating-point reduction operation, the value from the scalar register Rb may be selected as the operand opa, while in a subsequent cycle, the floating-point result from the previous iteration may be used, which is forwarded on the forwarding path 56 of the forwarding circuitry. On the other hand, the other operands opb may be selected from the various vector elements of the source vector Za for the fadd instruction.

As shown in FIG. 5, the floating point unit includes initial information determination circuitry 60, which includes a portion 60-0 within stage 0 and a portion 60-1 within stage 1. In parallel with the significand processing logic 50 completing the generation of the addition result of the previous iteration, the initial information generation circuitry 60 generates the operands opa, opb to be processed in the next iteration of the fadd instruction. Thus, two properties of the fadd instruction are utilized, which indicate that although an iteration of an iterative operation requires a result produced in a previous iteration, it is still possible to overlap the processing of successive iterations so that successive iterations are performed partially in parallel. First, for FADD instructions, the vector elements within the vector register Za are typically available from the beginning of instruction processing and do not depend on the addition output in the earlier cycle of operation, and then the stage 0 initial information generation circuitry 60-0 may pre-compute the initial information in parallel with the addition logic 50, 52 that generates the forwarded operand opa to be used for the next cycle.

Further, in general, exponent and sign processing logic 52 may have a shorter logic depth than significand processing logic 50. For example, the significand processing logic 50 may require relatively slow operations such as shift and carry propagate addition for alignment; while the exponent and sign may be generated based on combinational logic or by selecting between the exponent and sign of the input operand, and thus may be available earlier. Thus, before the complete forwarded operand 56 is available, early information 58 about the exponent or sign of the value to be used as operand opa for the next iteration may be obtained. Thus, before the full value of opa has been generated, stage 1 initial information generation circuitry 60-1 may pre-compute information about the operand opa to be used in the next cycle. At the same time, this stage 1 initial information generation 60-1 may pre-compute information (e.g., all or part of the exponent difference) based on a comparison of early information of the operand opa for the next iteration and a full value known for the operand opb for the next iteration.

By introducing some pre-computation of information about the operands into an earlier cycle, this means that the logic depth within the summing logic 50, 52 of the next cycle can be shorter, so that it is possible to use a higher clock frequency. Fig. 6 shows a timing diagram of how this is implemented. As shown in FIG. 6, compared to FIG. 4, clock cycle C2 in FIG. 6 may be reduced relative to clock cycle C1 of FIG. 4 because the initial portion of process 70 used to generate initial information for the operands for a given iteration is now overlapping with the completion of the remaining portion 72 used to generate the results of the previous iteration.

Thus, in iteration 0, the initial portion 70 of the processing is first performed within a set clock cycle to calculate initial information about the operands opa [0], opb [0] to be processed in iteration 0; and then in the next clock cycle (cycle 0) a result is generated that processes the input operands opa [0], opb [0] and this result is forwarded for use as operand opa [1] for iteration 1. At the same time, in parallel with the end of the remainder 72 of the process of iteration 0 that produces result OPA [1], the initial portion 70 of iteration 1 is performed in clock cycle 0 based on the value of opb [1] known in advance and based on the early information 58 returned on the exponent and sign of OPA [1 ]. Similarly, for iterations 2 and 3, the initial portion 70 begins before the remainder of the previous iteration ends.

This therefore means that the FADDA instruction can be supported on a system that uses a clock signal having a clock period C2 that is shorter than the combined delay L of the initial portion 70 and the subsequent portion 72 of a given iteration of the iterative operation. This enables a more aggressive scaling of the clock frequency to improve performance not only for the fadd instruction itself, but also for all other instructions supported by the processing circuitry; such as when the processing circuitry clocks at a higher clock rate, the other operations are accelerated.

FIG. 7 shows an example of a floating point reduction unit that may be provided within the floating point unit 22 discussed above. As shown on the left-hand side of fig. 7, when executing the fadd instruction as discussed above, the floating point reduction unit receives its input values from the scalar register Rb, the vector source register Za, and the predicate register Pg, respectively. At the time that the scalar operand from Rb and the vector elements el _1 to el _3 other than the first vector element el _0 are required in a later clock cycle when a subsequent iteration of the FADDA operation is performed, the scalar operand is held within the internal registers 100 of the floating-point reduction unit and the vector elements el _1 to el _3 are held within the internal registers 102 of the floating-point reduction unit in case the corresponding source registers Rb, Za have been overwritten. An internal register 100 is provided for holding scalar operands, since it is possible that the first valid element of the vector may be one of the elements el _1 to el _3 (el _0 is marked invalid if Pg is asserted), in which case the scalar values will be needed in subsequent iterations.

The enable logic 103 is provided to provide enable signals that enable flip-flops in the second execution stage 104 to be updated based on values from the first execution stage 106. In iteration X of the fadd operation, if the assertion of element el _ X indicates that element el _ X is invalid, enable logic 103 deasserts the flip-flop enable signal to ensure that output flip-flop 108 of second execution stage 104 retains its previous value and will not be updated, effectively providing a masking of the iteration as shown in the example of fig. 2 for iteration 1. This also saves power for invalid iterations of the iterative operation by preventing flip-flop updates. The first execution stage 106 comprises a multiplexer 110, the multiplexer 110 selecting from the register Za which of the elements el _0 to el _3 of the input vector is provided as an operand opb in the current iteration. The multiplexer is controlled by an element counter 112, which element counter 112 cycles through the values 0, 1, 2, 3 to select the next element of each cycle. The other operand opa of the addition is selected by selection circuitry 54, which selection circuitry 54 is implemented using multiplexer 54-0 and OR gate 54-1, as will be described in more detail below.

As shown in FIG. 5, the summing logic 50, 52 is provided in the second execution stage 104, and initial information generation circuitry 60 is provided that includes a portion 60-1 within the first stage 106 and a portion 60-1 within the second stage 104. In this example, the information on the operands opa, opb of the next iteration, which is computed in advance by the initial generation circuitry 60, includes an indication of whether the exponent of each operand opa, opb is 0, which is captured in the flip-flop 116. It may be useful to pre-compute whether the exponent is zero, as this may allow for early configuration of the logic within the floating- point adders 50, 52. For example, this may allow a determination of whether floating point operands are sub-normal, which may require adjusting the exponent from 0 to 1 prior to processing. In parallel with the calculation with the floating-point result of the previous iteration, this reduces latency within the second execution stage by incurring the logic delay associated with determining whether the exponent for one iteration is 0, making it more practical to fit the logic depth of the floating- point adders 50, 52 within a single clock cycle.

FIG. 7 illustrates a forwarding path 56 used as forwarding circuitry to forward floating-point results produced by a floating-point adder during one iteration back for input as an operand opa for the next iteration. While one might think that this would require a forwarded multiplexer to select between the forwarded result on the forwarding path 56 and the scalar input provided from register Rb, such a multiplexer may introduce significant additional delay. Thus, to speed up processing, it is the case that forwarding path 56 provides its input to OR gate 54-1 and the other input to OR gate 54-1 is the output of multiplexer 54-0. Thus, OR gate 54-1 performs only a bitwise OR of its two inputs and provides the result as operand opa (including exponent and fraction) to floating- point adders 50, 52. The or gate has a much shorter logic delay than the multiplexer and therefore it improves performance.

When an OR gate is used, then if any input is non-zero, the non-zero bits in that input will affect the output of OR gate 54-1. Thus, to ensure that only one of the OR gate inputs is non-zero, either the loop forwarding path 56, or the output of the multiplexer 54-0 should typically provide a 0 value. The flip-flop 120 is provided to provide a signal to the floating-point adder 60-1 that forces the value provided on the forwarding path 56 to clear 0 in the final iteration of processing of the given floating-point reduction instruction FADDA. Thus, this flip-flop 120 determines in a clock cycle when the element counter 112 is equal to the value "done" corresponding to the element position at the most significant end of the vector (e.g., "done" for the 4 element example of fig. 2 — 3). Thus, in the final iteration of performing one floating point vector reduction operation, the values on the forward loop signal path 56 are output as all zeros, meaning that when the floating point reduction unit begins performing the iterative operation of the next floating point vector reduction instruction, on the first cycle of the next instruction, the all zeros on the loop forwarding path 56 will not interfere with the value output by the multiplexer 54-0, which may then be selected based on the scalar value.

In any cycle except the first significant iteration of the iterative operation, opa multiplexer 54-0 selects all-zeros value 122 to be provided to OR gate 54-1 so that this value will not interfere with the forwarded result on forwarding path 56 provided from the addition in the previous iteration. In the first significant iteration for a given floating-point vector reduction instruction, opa multiplexer 54-0 selects one of: (1) an output 124 from the scalar source register Rb itself; (2) a saved scalar operand stored in internal registers 100 of the floating point reduction unit; (3) floating point operands are provided on instruction to instruction forwarding path 126, which instruction to instruction forwarding path 126 provides the result 108 from a FADDA instruction as potential input and as a scalar operand from the next instruction, to avoid the delay of writing the result to a register and then reading the result out again. Thus, if forwarding between instructions is desired, the value on the internal instruction-to-instruction forwarding path 126 may be selected. If forwarding is not required between instructions (because the previous instruction executed was not another floating point vector reduction instruction, or because the destination register Rd of the previous floating point vector reduction instruction is different from the source register Rb of the current floating point vector reduction instruction), and if the first iteration is valid, the value 124 from the original scalar register Rb is selected on the first iteration (corresponding to element el _ 0); or if the first valid iteration is one of the subsequent iterations 1, 2, 3 other than iteration 0, then the value from internal register 100 is selected on the first valid iteration because iteration 0 is marked invalid by the predicate. Once a given valid iteration has been processed, for any remaining iterations, the opa multiplexer 54-0 selects all-zeros 122 to avoid interference with iteration-to-iteration forwarding on the loop signal path 56.

Thus, since multiplexer 54-0 is outside of the critical timing path along loop 56, this improves performance and thus enables more aggressive scaling of the clock frequency.

FIG. 7 shows an example in which the output of multiplexer 54-0 passes through OR circuitry 54-1, which combines forwarded operands from forwarding path 56 with the scalar output of multiplexer 54-0. Yet another approach is to inject the scalar values of the initial iteration via a special path of light logic loading during the setup clock cycle to produce scalar values for the output of the forwarding circuitry 56; in addition to removing multiplexer 54-0, OR gate 54-1 will thus be completely removed from the single cycle forwarding return path 56.

Fig. 7 illustrates an example in which initial information on operand calculation for the next cycle is an indication whether the exponent of either operand opa, opb is zero. This allows an earlier construction of the significand. Also, in FIG. 7, the exponent value may be returned earlier, and for a given iteration, the processing of the exponent by the floating point adders 50, 52 may be skewed relative to the processing of the significand so that exponent processing begins before significand processing.

However, it is possible to determine other initial information on the operands opa, opb for the next iteration in parallel with completing the remainder of the processing for generating floating point addition results in the previous iteration. For example, it is possible to pre-compute the low order bits or sets of low order bits of the exponent difference between the exponents of the operands opa, opb in advance, or even compute the full exponent difference expa-expb and/or expb-expa. In some implementations, it may not be known in advance which of the operands has the larger exponent, so the exponent differences expa-expb and expb-expa may both be computed in parallel to allow for earlier processing. The exponent difference may be needed to determine the number of bit positions by which the significand of the smaller operand should be shifted to align with other significands, and may also be used to generate other information such as rounding status.

At the same time, the sign of the operands may be pre-computed earlier and may thus control whether a processing path for handling LSA or USA is available, e.g. may thus control the selection between near processing logic and far processing logic as discussed above. Furthermore, the pre-calculation logic 60 may calculate information about whether a given floating-point operand is infinite or NaN, which may control whether special handling logic for handling such special floating-point values is used.

Another example may pre-compute whether the operand opb is equal to 0 (this may be determined if both the skewed exponent and the fraction are 0), which may help speed up processing because more complex processing signal paths involving additions may be omitted if the operand is known to be 0. If for a given iteration opb is 0, the flip-flop enable signal for register 108 may be disabled to prevent the result of the previous iteration from being updated, since the previous result +0 is still the previous result.

Thus, various information about the operands opa and/or opb can be calculated, and by bringing this logic out of the critical path of a single iteration, this speeds up the process than iteration and enables the use of higher clock frequencies.

Fig. 8 is a flow chart illustrating a method of performing an iterative operation. At step 202, in a setup period, the initial information generation circuitry 60 determines initial information on operands to be processed in iteration n of the iterative operation, where n-0 for the initial setup period.

In a subsequent clock cycle, at step 204, the result generation circuitry 50, 52 generates a result value for iteration n, and in parallel with completing the remainder of the iteration used to generate the result, at step 206, the initial information generation circuitry 60 performs an initial portion of the processing of the next iteration to determine initial information about the operands to be executed on the next iteration n + 1.

At step 208, it may be determined whether iteration n is the last iteration of the iterative operation; if so, at step 210, the final result of iteration n produced at step 204 is output as the result of the instruction. However, if there is still another iteration to be processed, then at step 212, the forwarding circuitry 56 forwards the result of iteration n as input for the next iteration n +1, and n is incremented. Thus, in the next clock cycle, steps 204, 206 are performed again to perform the initial/remaining portion of the subsequent iterations n +2, n + 1.

Thus, with this technique, the latency associated with computing the initial information can be hidden in the latency shadow of the determination result to improve performance.

In this application, the word "configured to" is used to indicate that an element of an apparatus has a configuration capable of performing a defined operation. In this context, "configuration" means an arrangement or manner of interconnecting hardware or software. For example, the apparatus may have dedicated hardware providing defined operations, or a processor or other processing device that may be programmed to perform the functions. "configured to" does not mean that the device element needs to be changed in any way to provide the defined operation.

Although illustrative embodiments of the present invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.

26页详细技术资料下载

Processing of iterative operations

相关技术

网友询问留言