Processor having read shifter and control method using the same

文档序号:95762 发布日期:2021-10-12 浏览:17次 中文

阅读说明:本技术 具有读取移位器的处理器和使用所述处理器的控制方法 (Processor having read shifter and control method using the same ) 是由 陈丁明 于 2020-08-24 设计创作,主要内容包括:本发明提供一种具有读取移位器的处理器和使用所述处理器的控制方法,所述处理器包含,寄存器组、读取移位器、解码单元以及多个功能单元。寄存器组包含读取端口。读取移位器包含多个移位器条目且配置成每一时钟周期移出多个移位器条目当中的一移位器条目。多个移位器条目中的每一个与时钟周期相关联,且多个移位器条目中的每一个包括指示寄存器组的读取端口对于时钟周期中的读取操作的可用性。解码单元耦接到读取移位器且配置成基于读取移位器的多个移位器条目中包含的读取值来解码并发射指令。多个功能单元耦接到解码单元和寄存器组,且配置成执行由解码单元发射的指令并对寄存器组的读取端口执行读取操作。(The invention provides a processor with a reading shifter and a control method using the processor. The register set includes a read port. The read shifter includes a plurality of shifter entries and is configured to shift out a shifter entry among the plurality of shifter entries each clock cycle. Each of the plurality of shifter entries is associated with a clock cycle, and each of the plurality of shifter entries includes an indication of availability of a read port of the register set for a read operation in the clock cycle. The decode unit is coupled to the read shifter and configured to decode and transmit the instruction based on read values contained in a plurality of shifter entries of the read shifter. A plurality of functional units are coupled to the decode unit and the register bank and configured to execute instructions issued by the decode unit and to perform read operations on read ports of the register bank.)

1. A microprocessor, comprising:

a register bank including a read port;

a read shifter comprising a plurality of shifter entries, the read shifter configured to shift out a shifter entry among the plurality of shifter entries each clock cycle, wherein each of the plurality of shifter entries is associated with a clock cycle and each of the plurality of shifter entries comprises a read value, wherein the read value is to indicate whether the read port of the register set is available for a read operation in the clock cycle;

a decode unit coupled to the read shifter, the decode unit configured to decode an instruction and transmit the instruction based on the read values contained in the plurality of shifter entries of the read shifter; and

at least one functional unit coupled to the decode unit and the register bank, the at least one functional unit configured to execute the instruction emitted by the decode unit and to use data read out from the read port of the register bank.

2. The microprocessor of claim 1, wherein the read shifter is configured to shift out a bottom shifter entry of the read shifter every clock cycle, wherein the shifted-out value of the shifter controls the read operation of the register bank.

3. The microprocessor of claim 1, wherein

When the read value is a first predetermined value, the read port of the register set is unavailable for the read operation in the clock cycle; and

when the read value is a second predetermined value, the read port of the register set is available for the read operation in the clock cycle.

4. The microprocessor of claim 3, wherein

When a first instruction to be issued is configured to perform a first read operation in a first clock cycle, and the read shifter indicates that the read port of the register bank is unavailable in the first clock cycle, the decode unit is configured to stall the issuance of the first instruction, and

when the first instruction to be transmitted is configured to perform the first read operation in the first clock cycle, and the read shifter indicates that the read port of the register bank is available in the first clock cycle, the decode unit is configured to transmit the first instruction and set the read value of the shifter entry associated with the first clock cycle to the first predetermined value.

5. The microprocessor of claim 4, wherein

When the decode unit is configured to stall the transmission of the first instruction in the first clock cycle, the decode unit is configured to determine the availability of the read port of the register bank in a second clock cycle that is immediately subsequent to the first clock cycle.

6. The microprocessor of claim 1, further comprising:

an instruction fetch unit coupled to the decode unit, the instruction fetch unit configured to provide the instruction to the decode unit; and

an execution queue coupled to the decode unit and the at least one functional unit, the execution queue configured to store issued instructions into a plurality of queue entries of the execution queue and to provide the issued instructions stored in the execution queue to the at least one functional unit.

7. The microprocessor of claim 1, wherein

The register set further includes a plurality of registers having a plurality of register addresses, an

Each of the plurality of shifter entries of the read shifter further comprises:

a register address value configured to store a register address among the plurality of register addresses for the read operation of the instruction.

8. The microprocessor of claim 1, wherein each of the plurality of shifter entries of the read shifter further comprises:

a forwarding control bit configured to control a forwarding operation of the at least one functional unit, an

A functional unit value configured to indicate which functional unit is to forward the result data.

9. The microprocessor of claim 1, wherein

The read shifter is implemented using a rotation buffer, and the rotation buffer includes:

a read pointer configured to indicate a buffer address to read out an element from the rotating buffer, wherein the readout value controls the read operation of the register bank.

10. The microprocessor of claim 1, wherein the read shifter stalls shifting out of the shifter entry when the read values of the plurality of shifter entries are all a second predetermined value.

11. The microprocessor of claim 1, wherein

The read shifter includes a plurality of first read shifters, each of the plurality of first read shifters corresponding to one of the plurality of first read ports.

12. The microprocessor of claim 11, wherein

The instruction is a vector instruction including a plurality of micro-operations, an

The plurality of micro-operations of the vector instruction are controlled by a same first read shifter among the plurality of first read shifters.

13. The microprocessor of claim 11, wherein

The functional unit includes a plurality of functional units, each of the plurality of functional units being configured to perform a predetermined operation, an

One of the plurality of the first read shifters is dedicated to the predetermined operation functional unit.

14. The microprocessor of claim 13, wherein

Only when there is a valid operation in the predetermined operation functional unit, one of the plurality of the first read shifters will be configured to be dedicated to the predetermined operation functional unit.

15. A method of scheduling read ports of a register set in a microprocessor, comprising:

determining a first clock cycle based on a read time of an instruction to be transmitted;

determining availability of the read port in the first clock cycle based on a read value of a read shifter, wherein the read value is included in a shifter entry among a plurality of shifter entries and is shifted out of the plurality of shifter entries each clock cycle;

decoding and transmitting the instruction when the read value of the read shifter indicates that the read port is available for the first clock cycle; and

stalling the instruction when the read value of the read shifter indicates that the read port is unavailable for the first clock cycle.

16. The method of scheduling read ports for register sets in a microprocessor of claim 15, further comprising:

a read value of a shifter entry of the read shifter is accessed each clock cycle, wherein the read value of the shifter controls the read operation of the register bank.

17. The method of scheduling read ports for register sets in a microprocessor of claim 15, wherein

When the read value is a first predetermined value, the read port of the register set is available for the read operation in the first clock cycle; and

when the read value is a second predetermined value, the read port of the register set is unavailable for the read operation in the first clock cycle.

18. The method of scheduling read ports for register sets in a microprocessor of claim 15, further comprising:

setting the read value to a first predetermined value when the instruction is decoded and transmitted;

determining whether the read port of the register set is available in a second clock cycle following the first clock cycle; and

decoding and transmitting the instruction when the read value of the read shifter indicates that the read port is available in the second clock cycle.

19. The method of scheduling read ports of a register set in a microprocessor of claim 15, wherein each of the plurality of shifter entries of the read shifter further comprises:

a register address value configured to store a register address among the plurality of register addresses for the read operation of the instruction.

20. The method of scheduling read ports for register sets in a microprocessor of claim 15, wherein

The instruction is a vector instruction including a plurality of micro-operations, an

The micro-operation is included in a shifter entry of the read shifter.

21. The method of scheduling read ports for register sets in a microprocessor of claim 15, wherein

The read shifter includes a plurality of first read shifters, each of the plurality of first read shifters corresponding to one of the plurality of first read ports,

the instruction is a vector instruction including a plurality of micro-operations, an

The plurality of micro-operations of the vector instruction are included in a same first read shifter among the plurality of first read shifters.

22. The method of scheduling read ports for register sets in a microprocessor of claim 15, further comprising:

determining whether the read values of the plurality of shifter entries are all a first predetermined value; and

stall shifting out of the shifter entry when the read values of the plurality of shifter entries are all a second predetermined value.

23. The method of scheduling read ports for register sets in a microprocessor of claim 15, further comprising:

controlling a forwarding operation of at least one functional unit of the microprocessor according to forwarding control bits stored in each shifter entry of the read shifter, wherein each of the plurality of shifter entries of the read shifter further comprises:

a forwarding control bit configured to control a forwarding operation of the at least one functional unit, an

A functional unit value configured to indicate which functional unit is to forward the result data.

24. A data processing system comprising:

a microprocessor, wherein the microprocessor comprises:

a register bank including a read port;

a read shifter comprising a plurality of shifter entries, the read shifter configured to shift out a shifter entry among the plurality of shifter entries each clock cycle, wherein each of the plurality of shifter entries is associated with a clock cycle and each of the plurality of shifter entries comprises a read value, wherein the read value is to indicate whether the read port of the register set is available for a read operation in the clock cycle;

an instruction fetch unit configured to provide an instruction;

a decode unit coupled to the read shifter and the instruction fetch unit, the decode unit configured to decode the instruction based on the read values included in the plurality of shifter entries of the read shifter and to transmit the instruction;

an execution queue coupled to the decode unit, the execution queue configured to store issue instructions to a plurality of queue entries of the execution queue;

at least one functional unit coupled to the execution queue and the register bank, the at least one functional unit configured to execute the issue instruction and use read data from the read port of the register bank;

a main memory coupled to the microprocessor;

a bus bridge coupled to the microprocessor; and

an input/output device coupled to the bus bridge.

25. The data processing system of claim 24, wherein the read shifter is configured to shift out a bottom shifter entry of the read shifter every clock cycle, wherein the shifted-out value of the shifter controls the read operation of the register bank.

26. The data processing system of claim 24, wherein

When the read value is a first predetermined value, the read port of the register set is unavailable for the read operation in the clock cycle; and

the read port of the register set is available for the read operation in the clock cycle when the read value is a second predetermined value.

27. The data processing system of claim 26, wherein

When a first instruction to be issued is configured to perform a first read operation in a first clock cycle and the read shifter indicates that the read port of the register bank is unavailable in the first clock cycle, the decode unit is configured to stall the issuance of the first instruction, and

when the first instruction to be transmitted is configured to perform the first read operation in the first clock cycle and the read shifter indicates that the read port of the register bank is available in the first clock cycle, the decode unit is configured to transmit the first instruction and set the read value of the shifter entry associated with the first clock cycle to the first predetermined value.

28. The data processing system of claim 24, wherein the read shifter stalls shifting out of the shifter entry when the read values of the plurality of shifter entries are all a second predetermined value.

29. The data processing system of claim 27, wherein each of the plurality of shifter entries of the read shifter further comprises:

a register address value configured to store a register address among the plurality of register addresses for the read operation of the instruction; and

a forwarding control bit configured to control a forwarding operation of the at least one functional unit.

Technical Field

The present invention relates generally to a processor architecture, and more particularly, to a processor including a read shifter and a control method for controlling an instruction pipeline (instruction pipeline) of the processor.

Background

One way to improve the performance of a processor is to use an instruction pipeline, where several instructions can be executed in parallel. Since the number of read ports included in a register bank of a processor is limited, it is complicated to allocate a plurality of instructions issued in parallel to the read ports of the register bank. For example, when several instructions to be executed in an instruction pipeline require more read ports than are available in the same clock cycle, a conflict may occur, resulting in stalling and interlocking of instructions in the instruction pipeline. Thus, the performance of the processor is degraded. The problem is made worse when a vector instruction (vector instruction) containing multiple micro-operations is executed in the instruction pipeline.

With the demand for improving the performance of processors, there is an increasing need for processors and methods of controlling instruction pipelines that can effectively prevent stalling and interlocking of instructions.

Disclosure of Invention

In one embodiment, a microprocessor includes a register set, a read shifter, a decode unit, and a functional unit. The register set includes a read port. The read shifter includes a plurality of shifter entries and is configured to shift out a shifter entry among the plurality of shifter entries each clock cycle, wherein each of the plurality of shifter entries is associated with a clock cycle and each of the plurality of shifter entries includes a read value indicating availability of a read port of the register set for a read operation in the clock cycle. The decode unit is coupled to the read shifter and configured to decode the instruction and transmit the instruction based on read values contained in a plurality of shifter entries of the read shifter. The read shifter provides control to the functional units for reading data from the register banks. A functional unit is coupled to the decode unit and the register bank and configured to execute the instructions emitted by the decode unit and to use the readout data from the register bank, the functional unit being controlled by the read shifter.

In one embodiment, read ports of a register set in a microprocessor are scheduled using at least the following steps. A first clock cycle is determined based on a read time of an instruction to be transmitted. The availability of the read port in the first clock cycle is determined based on a read value of the read shifter, wherein the read value is included in a shifter entry among a plurality of shifter entries for a read latency, and the plurality of shifter entries are shifted out each clock cycle. When the read value of the read shifter indicates that the read port is available in the first clock cycle, the instruction is decoded and transmitted. The instruction is stalled when the read value of the read shifter indicates that the read port is unavailable in the first clock cycle.

Drawings

Aspects of the invention are best understood from the following detailed description when read with the accompanying drawing figures. It should be noted that, according to standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.

FIG. 1 is a block diagram illustrating a computer processing system according to some embodiments;

FIG. 2 is a block diagram illustrating a processor according to some embodiments;

3A-3B illustrate a read shifter according to some embodiments;

FIG. 3C illustrates a block diagram of a rotational buffer, according to some embodiments;

FIG. 4 is a flow diagram illustrating the transmission of a read shifter-based vector instruction according to some embodiments;

FIG. 5 is a flow diagram illustrating the transmission of a read shifter-based vector instruction according to some alternative embodiments;

FIG. 6 is a flow diagram illustrating the transmission of a vector instruction in the Kth clock cycle in accordance with some embodiments;

FIG. 7 is a flow diagram illustrating the transmission of vector instructions according to some alternative embodiments;

FIG. 8 is a flow diagram illustrating a method of controlling an instruction pipeline according to some embodiments.

Description of the reference numerals

10: a computer processing system;

100: a processor;

110. FUNIT _ A, FUNIT _ B, FUNIT _ C: a functional unit;

111: instructions;

120. 320, and (3) respectively: a transmitting/decoding unit;

121: a signal;

130. 330a, 330b _0, 330b _1, 330b _2, 330b _3, rdx, rdy: a read shifter;

140: a register group;

150. a, B, C: an execution queue;

151: transmitting an instruction;

160: a functional unit module;

170a, 170 b: an operand;

200: a memory;

300: an input/output interface;

330 c: a rotating buffer;

400: a peripheral device;

500: a bus;

d1: direction;

e [0], E [ k ], E [ N ]: a shifter entry;

rdx [ k ], rdx [ N:0], rd0[ k ], rd1[ k ], rd2[ k ], rd3[ k ]: reading a value;

RD _ P0, RD _ PK, RD _ Px, RD _ P0, RD _ P1, RD _ P2, RD _ P3: a read port;

REG _1, REG _ M: a register;

and (7) Rpt: reading a pointer;

s: a step size value;

s410, S420, S430, S440, S450, S460, S510, S520, S530, S540, S550, S560, S610, S620, S630, S640, S650, S660, S710, S720, S730, S740, S750, S760, S810, S820, S830, S840: a step of;

v1, V2, V3, V4: vector instructions;

vrx [ k ], vr0[ k ], vr1[ k ], vr2[ k ], vr3[ k ]: a register address value.

Detailed Description

The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. Specific examples of components and arrangements are described below to simplify the present disclosure. Of course, these are merely examples and are not intended to be limiting. For example, in the following description, formation of a first feature over or on a second feature may include embodiments in which the first feature is formed in direct contact with the second feature, and may also include embodiments in which additional features may be formed between the first and second features, such that the first and second features may not be in direct contact. Additionally, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

Furthermore, spatially relative terms such as "below …," "below …," "lower," "above …," "upper," and the like may be used herein for ease of description to describe one element or feature's relationship to another element or feature as illustrated in the figures. Spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. The device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.

Fig. 1 illustrates a computer processing system 10, which computer processing system 10 may include a processor 100, a memory 200, an input/output (I/O) interface 300, a peripheral device 400, and a bus 500, according to some embodiments. Bus 500 may allow for bi-directional communication among the components of computer processing system 10. Some embodiments of the invention may use more, fewer, or different components than those shown in fig. 1. As examples, the computer processing system 10 may also include a timer, a serial peripheral interface, a digital-to-analog converter, an analog-to-digital converter, a display driver, various types of memory, and any other suitable component.

In some embodiments, processor 100 is configured to execute instructions using an instruction pipeline (instruction pipeline), where execution of instructions is broken into stages including an instruction fetch (instruction fetch) stage, an instruction decode (instruction decode) stage, an instruction execution (instruction execution) stage, and a write back (writeback) stage. The processor 100 may include caches, such as a data cache and an instruction cache, that have relatively high access rates. The data cache of processor 100 may be a multi-level data cache, which may include an L1 data cache, an L2 data cache, and an L3 data cache. The L1 data cache, the L2 data cache, and the L3 data cache may be located internal or external to the processor 100. In some embodiments, computer processing system 10 may include multiple processors, and any number of processors may be the same as processor 100 or may be different from processor 100.

The memory 200 is configured to store program code for instructions and data required to execute the instructions. Memory 200 may include non-volatile memory or a combination thereof. For example, the memory 200 may include at least one of a Random Access Memory (RAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Electrically Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), and a flash memory.

I/O interface 300 is configured to couple input devices and output devices to computer system 10 via bus 500. Computer system 10 may receive data from input devices and send data to output devices through I/O interface 300. The I/O interface 300 may include at least one of a serial interface and a parallel interface in a wired or wireless connection. The peripheral devices 400 may include a keyboard, mouse, sensors, signal receivers, monitors, and any other suitable devices.

Fig. 2 illustrates a block diagram of a processor 100, according to some embodiments. The processor 100 may include an instruction fetch unit (instruction fetch unit)110, a transmission/decoding unit (instruction issue/decode unit)120, a read shifter (read shifter)130, a register file 140, an execution queue (execution queue)150, and a functional unit (functional unit) module 160. Instruction fetch unit 110 is configured to provide instructions 111 for execution by processor 100. In some embodiments, the instruction 111 may be a vector instruction that includes multiple micro-operations without operands, with one operand, two operands, or any other number of operands.

The register set 140 may include a plurality of registers REG _1 to REG _ M configured to store data of the register set 140. Registers REG _1 through REG _ M may store operands or data for executing instructions. The number of registers REG _1 to REG _ M may vary according to the architecture of the processor 100. In some embodiments, register set 140 further includes at least one read port (read port) RD _ P0 to RD _ PK from which data stored in registers REG _1 to REG _ M are read in a read operation. In some embodiments, register set 140 may further include at least one write port (not shown) for writing data into register set 140.

The read shifter 130 may include a plurality of shifter entries that shift every clock cycle. Each shifter entry of read shifter 130 is associated with a clock cycle (e.g., the k-th clock cycle, where k is a positive integer), and each shifter entry includes a read value (e.g., read value rdx [ k ]) that indicates availability of a read port of register set 140 in the clock cycle. Read value rdx [ k ] may be a multi-bit value or a single-bit value indicating a logical value of "1" and a logical value of "0". For example, when read value rdx [ k ] is a logical value of "1," it indicates that the read port is not available for a read operation in the k-th clock cycle. In other words, there is another read operation to be performed using the read port at the k-th clock cycle. Alternatively, when read value rdx [ k ] is a logic value of "0," it indicates that the read port is available for a read operation in the k-th clock cycle. In some embodiments, the read value of the read shifter 130 is set upon transmission of an instruction to be executed by the processor 100.

Issue/decode unit 120 is coupled to instruction fetch unit 110 and read shifter 130 and is configured to decode and issue instruction 111 based on a read value stored in a shifter entry of read shifter 130. In some embodiments, when providing instructions 111 to the issue/decode unit 120, the issue/decode unit 120 may determine whether to stall or issue an instruction 111 that includes a read operation in the k-th clock cycle based on the read value rdx [ k ] of the read shifter 130. When read value rdx [ k ] is "0," issue/decode unit 120 decodes instruction 111 and issues instruction 111 to execution queue 150. After the instruction 111 is issued, the read value rdx [ k ] of the read shifter 130 is set to "1" to prevent a collision using the read port in the k-th clock cycle. When read value rdx [ k ] of read shifter 130 is a "1," issue/decode unit 120 stalls the issue of instruction 111. When read value rdx [ k ] is a "1," it indicates that the read port is not available for read operations by instruction 111 in the k clock cycle, so instruction 111 must be stalled to avoid conflicts in the read port. Transmit/decode unit 120 may check the availability of the read port in the next clock cycle (e.g., (k +1) th clock cycle) by checking read value rdx [ k ] of read shifter 130. In some embodiments, transmit/decode unit 120 communicates with read shifter 130 via signal 121.

In some embodiments, the execution queue 150 is coupled to the transmit/decode unit 120 by a bus and is configured to arrange the transmit instructions in a queue. Execution queue 150 may provide issue instruction 151 to functional unit module 160 for execution.

Functional unit module 160 may include a plurality of functional units FUNIT _ A, FUNIT _ B, and FUNIT _ C configured to execute issue instructions provided by execution queue 150. In some embodiments, the functional unit module 160 may include an Arithmetic Logic Unit (ALU), an Address Generation Unit (AGU), a floating-point unit (FPU), a load-store unit (LSU), a Branch Execution Unit (BEU), and other functional units. In some embodiments, each functional unit has its own execution queue. In FIG. 2, execution queue 150 is composed of three execution queues A, B and C, one for each of functional unit FUNIT _ A, FUNIT _ B, and FUNIT _ C, respectively. In some embodiments, an instruction 151 to be executed by a functional unit of the processor 100 has at least one operand obtained by a read operation using a read port of the register set 140. In some embodiments, the number of read ports required by instruction 151 is equal to the number of possible source operands for instruction 151. For example, an instruction 151 having two operands 170a and 170b requires two read ports of the register set 130. In some embodiments, instruction 151 is a vector instruction that includes multiple micro-operations, and each micro-operation has the same read port assigned to the same source operand of instruction 151. The control mechanism is simplified by assigning the same read port to the same source operand of each micro-operation of the same instruction.

In some embodiments, read ports RD _ P0 through RD _ PK include a plurality of common read ports and a dedicated read port, where the common read port is common to all functional units or a group of functional units 160 and the dedicated read port is dedicated to a functional unit having an unknown read time. In some embodiments, the dedicated read port may be changed to the common read port based on some indication. For example, processor 100 may further include load and store units (not shown) that may be accessed directly by providing valid signals and register set addresses to register set 140 using a dedicated read port. The load and store unit of processor 100 may provide an indication of when a register read is made and an indication of when no store is present in execution queue 150. The dedicated read port may be controlled to be the common read port based on an indication of when a register read is to be performed and an indication of when there is no storage in the execution queue 150. The read shifter 130 may include storage bits (not shown) that indicate the store instructions in the application. Completing all store instructions in the load and store units of processor 100 clears the stored bits in read shifter 130, at which point the dedicated read port is now available for use with other functional units 160. The control mechanism of processor 100 is simplified because unknown read-time instructions (e.g., store instructions) may be assigned to dedicated read ports. In addition, since a dedicated read port can be shared in some cases, flexibility of the control mechanism is improved.

FIG. 3A illustrates a read shifter 330a associated with a read port RD _ Px (as shown in register set 140 of FIG. 2) according to some embodiments. The read port RD _ Px may be any one of the read ports RD _0 to RD _ PK. The read shifter 330a may include a plurality of shifter entries E [0] to E [ N ] that are shifted in the direction D1 each clock cycle. In some embodiments, the number of shifter entries N in the read shifter 330a is determined based on the maximum number of instructions in the pipeline. Shifter entry E [0] is referred to as a bottom shifter entry and shifter entry E [ N ] is referred to as a top shifter entry. The read shifter 330a may shift out the value of the bottom shifter entry E [0] every clock cycle. In some embodiments, shifter entry E [ k ] among shifter entry E [0] through shifter entry E [ N ] is associated with the kth clock cycle, and shifter entry E [ k ] may include a read value rdx [ k ] and a register address value vrx [ k ]. For example, shifter entry E [0] includes a read value rdx [0] which is "1" in FIG. 3A and a register address value vrx [0] which is "v 5" in FIG. 3A. Shift entry E [ N ] contains a read value rdx [ N ] that is a "0" and a register address value vrx [ N ] that is a blank register address value. In some embodiments, if there are no valid read values (e.g., when read values rdx [ N:0] are all "0"), then the read shifter does not shift at all. In some embodiments, the read value of the store instruction is not known until a later time during execution. A store instruction in the store execution pipeline may send a request to the read shifter to use a dedicated read port. The default value rdx [ k ] in the read port has higher priority in using the read port, and the store instruction must assert the read request until the read value rdx [ k ] is "0" to get valid store data from the read port. Since store instructions have a lower performance impact on the microprocessor, delayed reading of store data is not an issue.

Read value rdx [ k ] of read shifter 330a may indicate the availability of read port RD _ Px in the k-th clock cycle; and the register address value vrx [ k ] of read shifter 330a may indicate the address of the register from which the data was read. For example, when read value rdx [ k ] of read shifter 330a is a first predetermined value (e.g., "1"), it indicates that read port RD _ Px is busy in the k-th clock cycle. When read value rdx [ k ] of read shifter 330a is a second predetermined value (e.g., "0"), it indicates that read port RD _ Px is available for use in the k-th clock cycle.

In some embodiments, issue/decode unit 320 may control the issue of vector instructions based on the read values stored in shifter entries E [0] to shifter entry E [ N ] of read shifter 330 a. For example, if a vector instruction is required to be transmitted that includes a read operation in the k clock cycle using read port RD Px, then transmit/decode unit 320 may stall or transmit the vector instruction based on read value rdx [ k ] in read shifter 330 a. When the read value rdx [ k ] in read shifter 330a is "0," transmit/decode unit 320 may transmit a vector instruction. When the read value rdx [ k ] in the read shifter 330a is "1", the issue/decode unit 320 may stall the vector instruction and check the read value rdx [ k ] in the next clock cycle (or (k +1) th clock cycle). The issue instruction checks the same read value rdx [ k ] for each clock cycle until there is no longer a conflict at that cycle, sending the issue instruction to the execution queue 150 in FIG. 2. In this way, a collision using read port RD _ Px is prevented by simply checking the read value stored in read shifter 330 a. The pipeline control mechanism is simplified by using the read shifter 330 a. In some embodiments, the number of entries in the read shifter is based on the number of instructions that can be in the execution queue and functional unit at one time. For example, if the read shifter has 32 entries, the read shifter may check the read port availability of the instructions in the execution queue 150 for the functional unit 160 in the future cycle 32.

In some embodiments, each of shifter entries E [0] through E [ N ] of read shifter 330a may further store forwarding information (e.g., forwarding bits) configured to control the forwarding operations of functional units FUNIT _ A, FUNIT _ B, and FUNIT _ C in functional unit module 160. For example, when the forwarding information of the shifter entry indicates that forwarding is allowed, functional unit module 160 or logic of microprocessor 100 may be configured to forward result data from a functional unit of a previous instruction to a functional unit of a subsequent instruction. The functional unit of the previous instruction may be the same as or different from the functional unit of the subsequent instruction. Thus, functional units of a subsequent instruction may obtain result data directly from functional units of a previous instruction rather than from register set 140. The performance of microprocessor 100 is improved because the functional units of subsequent instructions may obtain result data in one clock cycle earlier than the data forwarding. Since the forwarding information of the issue instruction is contained in the read shifter 130, all register set read port control is centralized in the read shifter 130, thus simplifying the pipeline control mechanism of the microprocessor 100.

FIG. 3B illustrates read shifters 330B _0 through 330B _3 associated with read ports RD _ P0 through RD _ P3 (as shown in register set 140 of FIG. 2), according to some embodiments. Each of the read shifter 330B _0 to the read shifter 330B _3 in fig. 3B has a structure and components similar to those of the read shifter 330a shown in fig. 3A. Thus, each of read shifter 330b _0 through read shifter 330b _3 may include a plurality of shifter entries E [0] through E [ N ] that are offset in the D1 direction each clock cycle. Each shifter entry E [ k ] of read shifter 330b _0 may include a read value rd0[ k ] and a register address value vr0[ k ]; each shifter entry E [ k ] of read shifter 330b _1 may include a read value rd1[ k ] and a register address value vr1[ k ]; each shifter entry E [ k ] of read shifter 330b _2 may include a read value rd2[ k ] and a register address value vr2[ k ]; and each shifter entry E [ k ] of read shifter 330b _3 may include a read value rd3[ k ] and a register address value vr3[ k ]. The availability of read ports RD _ P0 through RD _ P3 may be determined based on shifter entries E [0] through E [ N ] of read shifters 330b _0 through 330b _3, respectively. In some embodiments, the transmit/decode unit 320 controls the transmission of vector instructions based on the values stored in the read shifter 330b _0 through the read shifter 330b _ 3. The issue/decode unit 320 may check the read values stored in read shifters 330b _0 through 330b _3 to determine whether to issue or stall a vector instruction and to determine which read port among read ports RD _ P0 through RD _ P3 to use during execution of the vector instruction. The number of read ports is to minimize the number of read port collisions for optimal performance, power and area for the application.

In some embodiments, the same read port is used for all micro-operations of the same vector instruction. In some embodiments, two read ports are used for vector instructions that require two operands. In some embodiments, at least one read port among read ports RD _ P0 through RD _ P3 is a dedicated read port and the remaining read ports are common read ports. The dedicated read port is dedicated to vector instructions that contain store instructions whose read time is unknown. In some embodiments, the read port RD _ P3 associated with read shifter 330b _3 is a dedicated read port and the other read ports RD _ P0 through RD _ P2 associated with read shifter 330b _0 through read shifter 330b _2 are common read ports.

In some embodiments, the read shifter 330a and the read shifter 330B _0 to the read shifter 330B _3 shown in fig. 3A and 3B may be implemented by a rotating buffer to save power consumption. Fig. 3C is a block diagram illustrating a rotation buffer 330C according to some embodiments. The rotation buffer 300c may include a plurality of buffer entries E [0] to E [ N ] configured to store values of a read shifter. In some embodiments, the rotating buffer 300c includes a read pointer Rpt, where the read point Rpt may indicate an address of a buffer entry that reads out the stored value, the address corresponding to read shifter entry E [0] in fig. 3A. As Rpt increases, entry E [0] becomes entry E [ N ] and the rdx value is reset to "0" to indicate that E [ N ] is currently available. In some embodiments, one buffer entry of the rotating buffer 330c is read out every clock cycle, and the read pointer Rpt may move every clock cycle to indicate the address of the shifted out buffer entry. In some embodiments, Rpt is not increased if rdx [ N:0] are all "0", indicating that data will not be read from the register bank at any time in the future.

The shift operation on the read shifter is power intensive because all shifter entries must be updated with new values every clock cycle. When the rotation buffer 330c is used as a read shifter, the read point Rpt may prevent the necessity of updating all shifter entries of the read shifter in each clock cycle. Thus, the power consumption of the read shifter implemented by the rotation buffer is reduced.

FIG. 4 is a flow diagram illustrating the transmission of a read shifter (e.g., read shifter 130 in FIG. 2) based vector instruction V1, according to some embodiments. The vector instruction V1 may include m micro-operations, each having one source operand. The micro-operation is performed 1 time for each clock cycle of the m cycles and in each clock cycle, the source operands are read from the register bank 140 of FIG. 2. In some embodiments, operands are read out of a register bank (e.g., register bank 140 in FIG. 2) through read port RD _ Px. In step 410, vector instruction V1 is provided to a transmit/decode unit (e.g., transmit/decode unit 120 in FIG. 2). In step S420, the transmission/decoding unit may determine which of the read values rdx [ m:1] is a first predetermined value (e.g., "1"). When none of the read values rdx [ m:1] is the first predetermined value, in steps S450 and S460, the vector instruction V1 is issued and all the read values rdx [ m:1] are set to the first predetermined value (e.g., "1"). When at least one of the read values rdx [ m:1] is a first predetermined value (e.g., "1"), the issue/decode unit stalls the vector instruction V1 in steps S430 and S440, and again checks the read value rdx [ m:1] in the next clock cycle to determine whether the vector instruction V1 was issued or stalled in the next clock cycle. The decode unit will check the same read value rdx [ m:1] every clock cycle until the read port becomes available to issue instructions to the execution queue 150. The transmit/decode unit may check the read value in the subsequent clock cycle until the vector instruction V1 is transmitted. Instructions are sent from execution queue 150 to functional units 160 each cycle through bus 151 of fig. 2.

FIG. 5 is a flow diagram illustrating the issuance of a vector instruction V2 having multiple micro-operations, each having multiple source operands based on a read shifter, according to some embodiments. In some embodiments, each operation is read from a read port of a register bank (e.g., register bank 140 in FIG. 2). Thus, when the vector instruction V2 has m micro-operations and two operands, two read shifters (e.g., read shifter rdx and read shifter rdy) associated with two read ports of the register set are used to control the launch of the vector instruction V2.

In step S510, a vector instruction V2 is provided. In step S520, the transmission/decoding unit may determine whether any one of the read value rdx [ m:1] and the read value rdy [ m:1] is a first predetermined value (e.g., "1"). In other words, the transmit/decode unit may check the read values in the m consecutive shifter entries for both read shifter rdx and read shifter rdy. When neither of the read value rdx [ m:1] or the read value rdy [ m:1] is the first predetermined value, in steps S550 and S560, a vector instruction V2 is issued and all the read values rdx [ m:1] and rdy [ m:1] are set to the first predetermined value (e.g., "1"). When at least one of the read value rdx [ m:1] and the read value rdy [ m:1] is a first predetermined value, the transmission/decoding unit stalls the vector instruction V2 and checks the read value rdx [ m:1] and the read value rdy [ m:1] again in the next clock cycle to determine whether the vector instruction V2 is transmitted or stalled in the next clock cycle in steps S530 and S540. The decode unit will check the same read values rdx [ m:1] and rdy [ m:1] every clock cycle until the read port becomes available to issue instructions to the execution queue 150. The transmit/decode unit may check the read values of the read shifter rdx and the read shifter rdy in subsequent clock cycles until the vector instruction V2 is transmitted.

FIG. 6 illustrates the transmission of a vector instruction V3 with a read time in the k-th clock cycle based on a read shifter (e.g., read shifter rdx), according to some embodiments. In step S610, the vector instruction V3 may have m micro-operation operations, where each micro-operation has one source operand. Vector instruction V3 may be issued by issue/decode unit 120 to execution units 150 of fig. 2, but is sent from execution units 150 to functional units 160 of fig. 2 in the k-th clock cycle when read shifter 130 controls the read ports of register set 140 to provide source operand data to functional units 160 of fig. 2. In step S620, the transmission/decoding unit may determine whether any one of the read values rdx [ (m + k): 1+ k) ] of the plurality of consecutive shifter entries of the read shifter rdx is a first predetermined value (e.g., "1"). When none of the read values rdx [ (m + k): (1+ k) ] is the first predetermined value, in step S650 and step S660, the vector command V3 is issued and the read values rdx [ (m + k): (1+ k) ] are set to the first predetermined value (e.g., "1"). When at least 1 of the read values rdx [ (m + k): (1+ k) ] is a first predetermined value, the transmission/decoding unit stalls the vector instruction V3 in steps S630 and S640, and checks again the read value rdx [ (m + k): (1+ k) ] in the next clock cycle to determine whether the vector instruction V3 is transmitted or stalled in the next clock cycle. The decode unit will check the same read value rdx [ (m + k): 1+ k) ] every clock cycle until the read port becomes available to issue the instruction to the execution queue 150. The transmit/decode unit may check the read value of the read shifter rdx in the subsequent clock cycle until the vector instruction V3 is transmitted.

FIG. 7 illustrates the transmission of a read shifter (e.g., read shifter rdx) based vector instruction V4, in accordance with some embodiments. The read values of the m micro-operations are not consecutive cycles, but have a step value of s. In this case, the readings are rdx [ m s +1, (m-1) s +1, (m-2) s +1, …,2s +1, s +1,1 ]. In step S710, the vector instruction V4 may have m micro-operations, where each micro-operation has one source operand. In step S720, the transmission/decoding unit may determine whether any one of the read values rdx [ k1, k2, k3, …, km ] in the plurality of non-consecutive shifter entries of the read shifter rdx is a first predetermined value (e.g., "1"), where k1, k2, k3, and km are positive integers. When none of the read values rdx [ k1, k2, k3, …, km ] is the first predetermined value, in step S750 and step S760, the vector instruction V4 is issued and the read values rdx [ k1, k2, k3, …, km ] are set to the first predetermined value (e.g., "1"). When at least one of the read values rdx [ k1, k2, k3, …, km ] is a first predetermined value, the transmission/decoding unit stalls the vector instruction V4 in steps S730 and S740, and checks again the read value rdx [ k1, k2, k3, …, km ] in the next clock cycle to determine whether the vector instruction V4 is transmitted or stalled in the next clock cycle. The issue/decode unit may check the read value rdx [ k1, k2, k3, …, km ] of the read shifter rdx in the subsequent clock cycle until the vector instruction V4 is issued. Note that this fig. 7 is the same as fig. 4 when s is 0.

FIG. 8 illustrates a method of controlling an instruction pipeline, according to some embodiments. In step S810, a first clock cycle is determined based on the read time of the instruction to be transmitted. In step S820, the availability of the read port in the first clock cycle is determined based on the read value of the read shifter, wherein the read value is included in a shifter entry among a plurality of shifter entries of the read latency, and the plurality of shifter entries are shifted out each clock cycle. In step S830, when the read value of the read shifter indicates that the read port is available in the first clock cycle, the instruction is decoded and issued. In step S840, the instruction is stalled when the read value of the read shifter indicates that the read port is unavailable in the first clock cycle.

According to some embodiments of the present invention, a processor is presented that includes a register set, a read shifter, a decoding unit, and a plurality of functional units. The register set includes a read port. The read shifter includes a plurality of shifter entries and is configured to shift out a shifter entry among the plurality of shifter entries each clock cycle. Each of the plurality of shifter entries is associated with a clock cycle, and each of the plurality of shifter entries includes an indication of availability of a read port of the register set for a read operation in the clock cycle. The decode unit is coupled to the read shifter and configured to decode and transmit the instruction based on read values contained in a plurality of shifter entries of the read shifter. A plurality of functional units are coupled to the decode unit and the register bank and configured to execute instructions issued by the decode unit and to perform read operations on read ports of the register bank.

According to some embodiments of the invention, a method of controlling an instruction pipeline is presented. The method comprises the following steps: determining a first clock cycle based on a read time of an instruction to be transmitted; determining availability of a read port in a first clock cycle based on a read value of a read shifter, wherein the read value is included in a shifter entry among a plurality of shifter entries for a read latency, and is shifted out of the plurality of shifter entries each clock cycle; decode and transmit an instruction when a read value of a read shifter indicates that a read port is available in a first clock cycle; and stalling the instruction when a read value of the read shifter indicates that the read port is unavailable in the first clock cycle.

The foregoing has outlined features of several embodiments so that those skilled in the art may better understand the detailed description that follows. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:处理数据依赖性的方法及其微处理器以及数据处理系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!