Configurable processor supporting variable-length vector processing and implementation method thereof

文档序号:189980 发布日期:2021-11-02 浏览:33次 中文

阅读说明:本技术 一种支持可变长向量处理的可配置处理器及其实现方法 (Configurable processor supporting variable-length vector processing and implementation method thereof ) 是由 雷国庆 隋兵才 王俊辉 郭维 郭辉 邓全 孙彩霞 倪晓强 郑重 黄立波 王永文 于 2021-07-30 设计创作,主要内容包括:本发明公开了一种支持可变长向量处理的可配置处理器及其实现方法,本发明支持可变长向量处理的可配置处理器包括指令获取单元、指令译码与重命名单元、指令定序与分派单元、指令发射与执行单元、寄存器文件、高速指令与数据缓存以及用于初始或实时对是否支持可变长向量处理的选项进行配置的可配置单元,指令译码与重命名单元、指令发射与执行单元两者均包含复用的用于处理变长向量、固定长度向量处理的执行单元。本发明能够支持可变长向量处理,可复用现有通用微处理器的数据通路,最小化设计复杂度和实现代价,同时兼顾可配置和灵活性,在提供足够可配置的条件下还能够不让芯片面积显著增加。(The invention discloses a configurable processor supporting variable length vector processing and an implementation method thereof, the configurable processor supporting variable length vector processing comprises an instruction acquisition unit, an instruction decoding and renaming unit, an instruction sequencing and dispatching unit, an instruction transmitting and executing unit, a register file, a high-speed instruction and data cache and a configurable unit for configuring an option whether to support variable length vector processing or not initially or in real time, wherein the instruction decoding and renaming unit and the instruction transmitting and executing unit both comprise a multiplexed executing unit for processing variable length vectors and fixed length vector processing. The invention can support variable length vector processing, can multiplex the data path of the prior general microprocessor, minimizes the design complexity and the realization cost, simultaneously considers the configurability and the flexibility, and can not increase the area of a chip obviously under the condition of providing enough configurability.)

1. A configurable processor supporting variable length vector processing comprises an instruction acquisition unit, an instruction decoding and renaming unit, an instruction sequencing and dispatching unit, an instruction transmitting and executing unit, a register file and a high-speed instruction and data cache, wherein the instruction acquisition unit, the instruction decoding and renaming unit, the instruction sequencing and dispatching unit, the instruction transmitting and executing unit and the register file are sequentially connected, and the high-speed instruction and data cache is respectively connected with the instruction acquisition unit and the instruction transmitting and executing unit; the instruction decoding and renaming unit comprises a variable length vector splitting module which is used for splitting the variable length vector processing operation under the condition that the option supporting the variable length vector processing in the configurable unit is configured to support the variable length vector processing so as to realize the reuse of the execution unit of the existing fixed length vector processing; the instruction issue and execution unit includes a vector issue queue and a vector execution unit for executing vector instructions, and the vector execution unit supports partial or full execution of fixed length vectors and variable length vectors and completes execution of the variable length vectors one or more times for the variable length vectors.

2. The configurable processor of claim 1, wherein the configurable unit is further configured to initially or in real-time configure branch prediction options, the branch prediction options comprising a static branch prediction mode and a dynamic branch prediction mode; when the instruction obtaining unit obtains instructions from the high-speed instruction and the data cache, if the branch prediction option is configured to be in a static branch prediction mode, the instruction obtaining unit obtains more instructions to be executed from the high-speed instruction and the data cache according to a static prediction address carried in the instructions; if the branch prediction option is configured as a dynamic branch prediction mode, the instruction fetch unit fetches more instructions to be executed from the cache and the data cache according to a dynamic prediction address carried in the instruction.

3. The configurable processor supporting variable length vector processing according to claim 2, further configured in the configurable unit to configure the fetch width M initially or in real time; when the instruction obtaining unit obtains the instructions from the high-speed instruction and data cache, the instruction obtaining unit obtains M instructions from the high-speed instruction and data cache each time and stores the M instructions into the local cache.

4. The configurable processor supporting variable length vector processing according to claim 3, further configured in the configurable unit to configure the decoding width N initially or in real time; the instruction decoding and renaming unit fetches N instruction micro instructions from the local predecoding micro instruction cache each time for decoding when decoding.

5. The configurable processor of claim 4, wherein the configurable unit is further configured to configure an instruction dispatch width P initially or in real time; the instruction sequencing and dispatching unit acquires at most P instructions from the local to-be-dispatched cache for dispatching each time when the instruction dispatching is carried out.

6. The configurable processor of claim 5, wherein the configurable unit is further configured to initially or in real-time configure out-of-order resource sizes, the out-of-order resource sizes comprising a transmit queue size, a reorder buffer size, and a register file size; the instruction issue and execution unit stores instructions to be issued based on an issue queue size configured by the configurable unit, and the instruction sequencing and dispatch unit performs a re-sequencing operation based on a re-sequencing cache size configured by the configurable unit.

7. The configurable processor supporting variable length vector processing according to claim 6, wherein the instruction issue and execution unit comprises two parts, an issue queue and an execution unit, wherein the issue queue comprises a branch issue queue, a single cycle integer issue queue, a multi-cycle integer issue queue, a vector/floating point issue queue and an access issue queue, wherein the execution unit comprises a branch execution unit, a single cycle integer execution unit, a multi-cycle integer execution unit, a vector/floating point execution unit and an access execution unit, the vector issue queue is a vector/floating point issue queue, the vector execution unit is a vector/floating point execution unit, and each type of instruction is selected to issue and execute independently.

8. The configurable processor of claim 7, wherein the register files comprise an integer register file, a vector/floating point register file, a flag bit register file, a predicate register file, and a system register file.

9. A method for implementing a configurable processor supporting variable length vector processing according to any one of claims 1 to 8, comprising: detecting whether an instruction identifier of a variable-length vector processing requirement exists in an application program in real time in the running process of the application program, if the instruction identifier of the variable-length vector processing requirement exists, configuring an option whether to support variable-length vector processing or not through a configurable unit to support variable-length vector processing, and specifying the maximum supported vector length to enter a variable-length vector processing supporting mode, otherwise configuring an option whether to support variable-length vector processing or not through the configurable unit to not support variable-length vector processing, and entering a traditional fixed-length vector processing mode; in the variable-length vector processing mode, a variable-length vector splitting module in an instruction decoding and renaming unit splits the variable-length vector processing operation so as to reuse the variable-length vector processing operation to an execution unit for processing the existing fixed-length vector, and the splitting of the variable-length vector splitting module in the instruction decoding and renaming unit for the variable-length vector processing operation specifically means splitting the variable-length vector processing operation as required in a decoding stage according to the relationship between the length L of the variable-length vector and a preset fixed vector length CL and splitting the variable-length vector into one or more instructions with the fixed vector length CL; the vector execution unit of the instruction issue and execution unit completes execution of the variable length vector for the variable length vector by one or more times, enabling reuse of the vector execution unit for existing fixed length vector processing.

10. The method of claim 9, wherein the step of executing the instructions in the variable length vector processing mode comprises:

s1) instruction fetch: the instruction obtaining unit obtains M instructions from the high-speed instruction cache every time according to the configured instruction obtaining width M and stores the M instructions into a local cache; if configured to support static prediction or dynamic branch prediction, the instruction prefetcher will fetch more instructions to execute from either the static or dynamic prediction addresses;

s2) instruction decode and rename: the instruction decoding and renaming unit firstly pre-decodes one or more instructions each time according to a microinstruction splitting mode to split the instructions into one or more microinstructions, assigns global id to the microinstructions obtained by splitting and stores the global id into a local pre-decoding microinstruction cache; then, N micro instructions are fetched from a local pre-decoding micro instruction cache each time according to the configured instruction decoding width N for decoding, and register operands in the micro instructions are renamed;

s3) instruction sequencing and dispatching: the instruction sequencing and dispatching unit acquires at most P instructions from a local cache to be dispatched each time according to the configured instruction dispatching width P, dispatches and sends the instructions to the instruction transmitting and executing unit based on the types of the instructions, and allocates one item in a re-sequencing cache for managing the life cycle state of the microinstruction for all microinstructions allocated with unique id; when the micro-execution corresponding to the id is finished in the execution unit, a finished state is returned to the instruction sequencing and dispatching unit, and the micro-instruction state is modified to be 'to be submitted'; when all the continuous microinstructions with a plurality of ids are in a state of waiting to be submitted, directly submitting the microinstructions by taking a group as a unit;

s4) instruction issue and execution: the instruction issue and execution unit handles the different processing for the type of instruction dispatched: the branch instruction from the instruction sequencing and dispatching unit is cached in a branch transmitting queue, and is executed in a branch execution unit after being transmitted, the execution result of the branch instruction comprises a PC address for appointing the next instruction to be executed, and whether the branch prediction is invalid or not; the single-cycle instruction from the instruction sequencing and dispatching unit is cached in a single-cycle transmission queue and is executed in a single-cycle integer execution unit after transmission, and the execution result of the single-cycle instruction comprises a modification target integer register or a modification mark register; the multi-cycle instructions from the instruction sequencing and dispatching unit are cached in a multi-cycle emission queue and are executed in a multi-cycle execution unit after being emitted, and the execution result of the multi-cycle instructions comprises a modification target integer register or a modification flag register or a modification target system register; the vector/floating point instruction from the instruction sequencing and dispatching unit is cached in a vector/floating point transmitting queue and is executed in a vector/floating point execution unit after being transmitted, and the execution result of the vector floating point instruction comprises a modification destination vector/floating point register or a modification flag register; the access instruction from the instruction sequencing and dispatching unit is cached in the access transmitting queue and is executed in the access executing unit after being transmitted, the execution behavior of the access instruction is divided into loading data from a memory or writing data into the memory, and the execution results are respectively loading data into a target register and storing data into the memory.

Technical Field

The invention relates to a design and implementation technology of a microprocessor, in particular to a configurable processor supporting variable-length vector processing and an implementation method thereof.

Background

Vector processing is one of the main forms of data processing currently performed by microprocessors. At present, mainstream microprocessor manufacturers propose vector processing instructions and processors for the high-performance computing field. Such as: the Intel corporation introduced the advanced vector processing instruction set AVX-512 in 2016, and is currently mainly used in server products such as Xeon Phi; the AMD corporation issued an extended instruction set SSE5 based on the x86 architecture in the past years, primarily to enhance its high performance computing power; the ARM company successively and respectively promotes the Neon instruction set and the SVE instruction set, thereby accelerating the pace of high-performance calculation of marching.

The ARM SVE is a typical variable length vector processing instruction set that focuses on high performance computing, and can support maximum vector lengths of up to 2048 bits, which may be an integer multiple of 128 between 0 and 2048. The SVE has the remarkable characteristic that a software programmer can perform vector processing programming without sensing the vector length realized by hardware, and the compiled executable code can be executed on a processor supporting different SVE vector lengths, so that great convenience is brought to a software developer. The registers of the SVE include vector registers and predicate registers. Assuming that the SVE vector register has a length VL, the predicate register bit width is defined as PL VL/8, and each binary bit of the predicate register corresponds to each byte of the vector register. The vector register contains a plurality of vector elements, each element being 8 bits, 16 bits, 32 bits and 64 bits wide, and correspondingly the predicate register contains a plurality of predicate elements, each element being 1 bit, 2 bits, 4 bits and 16 bits wide corresponding to the vector element. The SVE instruction set employs predicate registers to "predict" each element of a vector, thereby enabling flexible processing of different length elements of different length vectors.

The prior disclosed processor supporting variable length vector processing is fujiten a64FX, which supports a maximum vector length of 512 bits. ARM also disclosed its first version of the kernel Neoverse V1 supporting extensible vector extensions (SVE) in 9/2020 and claimed to bring up to 50% performance improvement for HPC and ML workloads. Existing processors that support variable length vector processing tend to be highly performance intensive, with all the disadvantages in configurability and flexibility. In order to achieve both high performance and general purpose processing, it is necessary to design a flexible and configurable processor that supports variable length vector processing.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the invention can support variable length vector processing, can multiplex the data path of the existing general microprocessor, minimize design complexity and implementation cost, simultaneously consider configurability and flexibility, and can not increase the chip area obviously under the condition of providing enough configurability.

In order to solve the technical problems, the invention adopts the technical scheme that:

a configurable processor supporting variable length vector processing comprises an instruction acquisition unit, an instruction decoding and renaming unit, an instruction sequencing and dispatching unit, an instruction transmitting and executing unit, a register file and a high-speed instruction and data cache, wherein the instruction acquisition unit, the instruction decoding and renaming unit, the instruction sequencing and dispatching unit, the instruction transmitting and executing unit and the register file are sequentially connected, the high-speed instruction and data cache is respectively connected with the instruction acquisition unit and the instruction transmitting and executing unit, and the configurable processor supporting variable length vector processing comprises a configurable unit used for initially or in real time configuring options whether variable length vector processing is supported; the instruction decoding and renaming unit comprises a variable length vector splitting module which is used for splitting the variable length vector processing operation under the condition that the option supporting the variable length vector processing in the configurable unit is configured to support the variable length vector processing so as to realize the reuse of the execution unit of the existing fixed length vector processing; the instruction issue and execution unit includes a vector issue queue and a vector execution unit for executing vector instructions, and the vector execution unit supports partial or full execution of fixed length vectors and variable length vectors and completes execution of the variable length vectors one or more times for the variable length vectors.

Optionally, the configurable unit is further configured to initially or in real time configure a branch prediction option, where the branch prediction option includes two modes, namely a static branch prediction mode and a dynamic branch prediction mode; when the instruction obtaining unit obtains instructions from the high-speed instruction and the data cache, if the branch prediction option is configured to be in a static branch prediction mode, the instruction obtaining unit obtains more instructions to be executed from the high-speed instruction and the data cache according to a static prediction address carried in the instructions; if the branch prediction option is configured as a dynamic branch prediction mode, the instruction fetch unit fetches more instructions to be executed from the cache and the data cache according to a dynamic prediction address carried in the instruction.

Optionally, the configurable unit is further configured to initially or in real time configure the fetch width M; when the instruction obtaining unit obtains the instructions from the high-speed instruction and data cache, the instruction obtaining unit obtains M instructions from the high-speed instruction and data cache each time and stores the M instructions into the local cache.

Optionally, the configurable unit is further configured to initially or in real time configure the decoding width N; the instruction decoding and renaming unit fetches N instruction micro instructions from the local predecoding micro instruction cache each time for decoding when decoding.

Optionally, the configurable unit is further configured to initially or in real time configure the instruction dispatch width P; the instruction sequencing and dispatching unit acquires at most P instructions from the local to-be-dispatched cache for dispatching each time when the instruction dispatching is carried out.

Optionally, the configurable unit is further configured to initially or in real time configure an out-of-order resource size, where the out-of-order resource size includes a transmission queue size, a re-order buffer size, and a register file size; the instruction issue and execution unit stores instructions to be issued based on an issue queue size configured by the configurable unit, and the instruction sequencing and dispatch unit performs a re-sequencing operation based on a re-sequencing cache size configured by the configurable unit.

Optionally, the instruction transmitting and executing unit includes two parts, namely a transmitting queue and an executing unit, where the transmitting queue includes a branch transmitting queue, a single-cycle integer transmitting queue, a multi-cycle integer transmitting queue, a vector/floating point transmitting queue and a memory access transmitting queue, where the executing unit includes a branch executing unit, a single-cycle integer executing unit, a multi-cycle integer executing unit, a vector/floating point executing unit and a memory access executing unit, the vector transmitting queue refers to a vector/floating point transmitting queue, the vector executing unit refers to a vector/floating point executing unit, and each type of instruction is selected to be transmitted and executed independently.

Optionally, the register files include an integer register file, a vector/floating point register file, a flag bit register file, a predicate register file, and a system register file.

In addition, the present invention also provides an implementation method of the foregoing configurable processor supporting variable length vector processing, including: detecting whether an instruction identifier of a variable-length vector processing requirement exists in an application program in real time in the running process of the application program, if the instruction identifier of the variable-length vector processing requirement exists, configuring an option whether to support variable-length vector processing or not through a configurable unit to support variable-length vector processing, and specifying the maximum supported vector length to enter a variable-length vector processing supporting mode, otherwise configuring an option whether to support variable-length vector processing or not through the configurable unit to not support variable-length vector processing, and entering a traditional fixed-length vector processing mode; in the variable-length vector processing mode, a variable-length vector splitting module in an instruction decoding and renaming unit splits the variable-length vector processing operation so as to reuse the variable-length vector processing operation to an execution unit for processing the existing fixed-length vector, and the splitting of the variable-length vector splitting module in the instruction decoding and renaming unit for the variable-length vector processing operation specifically means splitting the variable-length vector processing operation as required in a decoding stage according to the relationship between the length L of the variable-length vector and a preset fixed vector length CL and splitting the variable-length vector into one or more instructions with the fixed vector length CL; the vector execution unit of the instruction issue and execution unit completes execution of the variable length vector for the variable length vector by one or more times, enabling reuse of the vector execution unit for existing fixed length vector processing.

Optionally, the step of executing the instruction in the variable length vector processing mode includes:

s1) instruction fetch: the instruction obtaining unit obtains M instructions from the high-speed instruction cache every time according to the configured instruction obtaining width M and stores the M instructions into a local cache; if configured to support static prediction or dynamic branch prediction, the instruction prefetcher will fetch more instructions to execute from either the static or dynamic prediction addresses;

s2) instruction decode and rename: the instruction decoding and renaming unit firstly pre-decodes one or more instructions each time according to a microinstruction splitting mode to split the instructions into one or more microinstructions, assigns global id to the microinstructions obtained by splitting and stores the global id into a local pre-decoding microinstruction cache; then, N micro instructions are fetched from a local pre-decoding micro instruction cache each time according to the configured instruction decoding width N for decoding, and register operands in the micro instructions are renamed;

s3) instruction sequencing and dispatching: the instruction sequencing and dispatching unit acquires at most P instructions from a local cache to be dispatched each time according to the configured instruction dispatching width P, dispatches and sends the instructions to the instruction transmitting and executing unit based on the types of the instructions, and allocates one item in a re-sequencing cache for managing the life cycle state of the microinstruction for all microinstructions allocated with unique id; when the micro-execution corresponding to the id is finished in the execution unit, a finished state is returned to the instruction sequencing and dispatching unit, and the micro-instruction state is modified to be 'to be submitted'; when all the continuous microinstructions with a plurality of ids are in a state of waiting to be submitted, directly submitting the microinstructions by taking a group as a unit;

s4) instruction issue and execution: the instruction issue and execution unit handles the different processing for the type of instruction dispatched: the branch instruction from the instruction sequencing and dispatching unit is cached in a branch transmitting queue, and is executed in a branch execution unit after being transmitted, the execution result of the branch instruction comprises a PC address for appointing the next instruction to be executed, and whether the branch prediction is invalid or not; the single-cycle instruction from the instruction sequencing and dispatching unit is cached in a single-cycle transmission queue and is executed in a single-cycle integer execution unit after transmission, and the execution result of the single-cycle instruction comprises a modification target integer register or a modification mark register; the multi-cycle instructions from the instruction sequencing and dispatching unit are cached in a multi-cycle emission queue and are executed in a multi-cycle execution unit after being emitted, and the execution result of the multi-cycle instructions comprises a modification target integer register or a modification flag register or a modification target system register; the vector/floating point instruction from the instruction sequencing and dispatching unit is cached in a vector/floating point transmitting queue and is executed in a vector/floating point execution unit after being transmitted, and the execution result of the vector floating point instruction comprises a modification destination vector/floating point register or a modification flag register; the access instruction from the instruction sequencing and dispatching unit is cached in the access transmitting queue and is executed in the access executing unit after being transmitted, the execution behavior of the access instruction is divided into loading data from a memory or writing data into the memory, and the execution results are respectively loading data into a target register and storing data into the memory.

Compared with the prior art, the invention has the following advantages:

1. the configurable processor supporting variable length vector processing comprises a configurable unit for initially or in real time configuring an option whether to support variable length vector processing; the instruction decoding and renaming unit comprises a variable length vector splitting module which is used for splitting the variable length vector processing operation under the condition that the option supporting the variable length vector processing in the configurable unit is configured to support the variable length vector processing so as to realize the reuse of the execution unit of the existing fixed length vector processing; the instruction transmitting and executing unit comprises a vector transmitting queue and a vector executing unit, the vector transmitting queue and the vector executing unit are used for executing vector instructions, the vector executing unit supports partial or all execution of fixed length vectors and variable length vectors, execution of the variable length vectors is completed for one time or multiple times aiming at the variable length vectors, and by the aid of the added configurable unit and combination of the instruction decoding and renaming unit and the improvement of the instruction transmitting and executing unit, the fixed length and variable length vector processing can be simultaneously supported, application requirements of various scenes can be met, the flexibility of a processor is further improved, and the purposes of configuration and flexibility are achieved.

2. The invention can realize the reuse of the execution unit of the existing fixed-length vector processing in the instruction decoding and renaming unit and the reuse of the vector execution unit in the instruction transmitting and execution unit, and the reuse can simultaneously support the vector processing with fixed length and variable length.

Drawings

Fig. 1 is a block diagram of a configurable processor supporting variable length vector processing according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a register configuration structure of a configurable unit according to an embodiment of the present invention.

Fig. 3 is a basic flowchart of an implementation method of the embodiment of the present invention.

FIG. 4 is a diagram illustrating an exemplary instruction fetch unit according to an embodiment of the present invention.

FIG. 5 is a schematic diagram illustrating an exemplary instruction decode and rename unit according to the invention.

FIG. 6 is a schematic diagram illustrating the execution of an instruction sequencing and dispatch unit according to an embodiment of the present invention.

FIG. 7 is a schematic diagram illustrating an exemplary instruction issue and execution unit according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1 and fig. 2, the configurable processor supporting variable length vector processing according to this embodiment includes an instruction obtaining unit, an instruction decoding and renaming unit, an instruction sequencing and dispatching unit, an instruction transmitting and executing unit, a register file, and a high-speed instruction and data cache, where the instruction obtaining unit, the instruction decoding and renaming unit, the instruction sequencing and dispatching unit, the instruction transmitting and executing unit, and the register file are sequentially connected, and the high-speed instruction and data cache is respectively connected to the instruction obtaining unit and the instruction transmitting and executing unit, and it should be noted that the instruction obtaining unit, the instruction decoding and renaming unit, the instruction sequencing and dispatching unit, the instruction transmitting and executing unit, the register file, and the high-speed instruction and data cache are existing components of the processor, and their functions are as follows:

the configurable unit is used for configuring the vector execution mode and the architecture parameter of the processor;

the instruction acquisition unit is used for acquiring an instruction from the high-speed instruction cache and storing the instruction into the local cache so as to be used for the instruction decoding and renaming unit to perform the next processing;

the instruction decoding and renaming unit is used for decoding the instruction and renaming the register type operand, wherein renaming refers to renaming the register operand of each split microinstruction, and the renamed microinstruction is sent to the instruction sequencing and dispatching unit;

the instruction sequencing and dispatching unit is used for dispatching the instructions to the instruction transmitting and executing unit according to the order of instruction decoding; maintaining the instruction state in the out-of-order execution and ensuring the instruction to be submitted in sequence; the instruction sequencing and dispatching unit maintains the state of the instructions executed out of order, and ensures that the instructions of different instruction types can be confirmed and submitted in sequence after the execution of the instructions is completed;

the instruction transmitting and executing unit is used for transmitting and executing instructions.

The register file is used for storing various data required by the instruction transmitting and executing unit and obtained by operation.

The cache instructions and data caches are used to cache instructions and data from the interconnect or external memory.

In order to implement variable-length vector processing, the following improvements are made on the basis of the existing processor in the embodiment: the configurable processor supporting variable length vector processing comprises a configurable unit for initially or in real time configuring an option whether variable length vector processing is supported; the instruction decoding and renaming unit comprises a variable length vector splitting module which is used for splitting the variable length vector processing operation under the condition that the option supporting the variable length vector processing in the configurable unit is configured to support the variable length vector processing so as to realize the reuse of the execution unit of the existing fixed length vector processing; the instruction issue and execution unit includes a vector issue queue and a vector execution unit for executing vector instructions, and the vector execution unit supports partial or full execution of fixed length vectors and variable length vectors and completes execution of the variable length vectors one or more times for the variable length vectors. During execution, the instruction obtaining unit obtains an instruction from the high-speed instruction cache and sends the instruction to the instruction decoding and renaming unit for further processing; the instruction decoding and renaming unit decodes and renames the instruction and sends the instruction to the instruction sequencing and dispatching unit for further processing; the instruction sequencing and dispatching unit is used for sequentially dispatching instructions to the instruction transmitting and executing unit and performing sequencing submission processing on the instructions finished out of order; the instruction transmitting and executing unit is used for caching instructions to be executed, respectively caching branch instructions, single-cycle integers, multi-cycle integers, vector/floating point instructions and access instructions into different transmitting queues according to different instruction types, independently selecting the corresponding type of instructions for transmitting in each transmitting queue, and periodically selecting the instructions meeting transmitting conditions to be transmitted to the related executing unit for execution; the transmitting queue is responsible for maintaining whether the instruction meets the transmitting condition or not, whether the execution needs to be cancelled or not after the instruction is transmitted or not, and the like; normal issue of instructions whose execution completes requires instruction sequencing and dispatch units to be notified to ensure that eventually all instructions can commit in order.

Referring to fig. 2, the configurable unit of this embodiment is further configured to initially or in real time configure branch prediction options, where the branch prediction options include a static branch prediction mode and a dynamic branch prediction mode; when the instruction obtaining unit obtains instructions from the high-speed instruction and the data cache, if the branch prediction option is configured to be in a static branch prediction mode, the instruction obtaining unit obtains more instructions to be executed from the high-speed instruction and the data cache according to a static prediction address carried in the instructions; if the branch prediction option is configured as a dynamic branch prediction mode, the instruction fetch unit fetches more instructions to be executed from the cache and the data cache according to a dynamic prediction address carried in the instruction.

Referring to fig. 2, the configurable unit of this embodiment is further configured to initially or in real time configure the fetch width M; when the instruction obtaining unit obtains the instructions from the high-speed instruction and data cache, the instruction obtaining unit obtains M instructions from the high-speed instruction and data cache each time and stores the M instructions into the local cache.

Referring to fig. 2, the configurable unit of this embodiment is further configured to initially or in real time configure the decoding width N; the instruction decoding and renaming unit fetches N instruction micro instructions from the local predecoding micro instruction cache each time for decoding when decoding.

Referring to fig. 2, the configurable unit of the present embodiment is further configured to initially or in real time configure the instruction dispatch width P; the instruction sequencing and dispatching unit acquires at most P instructions from the local to-be-dispatched cache for dispatching each time when the instruction dispatching is carried out.

Referring to fig. 2, the configurable unit in this embodiment is further configured to initially or in real time configure the size of the out-of-order resource, where the size of the out-of-order resource includes a size of a transmission queue, a size of a re-order buffer, and a size of a register file; the instruction issue and execution unit stores instructions to be issued based on an issue queue size configured by the configurable unit, and the instruction sequencing and dispatch unit performs a re-sequencing operation based on a re-sequencing cache size configured by the configurable unit.

As can be seen from the above, the configurable unit of this embodiment may perform initial or real-time configuration of all the options (including the vector execution mode and the architecture parameters) as required. The vector execution mode comprises a static branch prediction mode and a dynamic branch prediction mode, the architecture parameters comprise static and dynamic configurable architecture parameters supported by the processor, such as whether branch static and dynamic prediction is supported, instruction fetching, decoding, dispatching width and the like, and the number of out-of-order resources such as transmission queue size, re-order cache size, register file size and the like; depending on the processor support parameter configuration attribute, either static or dynamic configuration may be selected. During initial configuration, the basic state of the processor, including a vector execution mode, architecture parameters and the like, can be configured according to the operation requirement, and the processor is started to start executing; the vector execution mode and architecture parameters may also be modified dynamically as needed if the processor is already in an execution state. The related configuration of the processor can be modified during the initialization process and the running process of the processor, so that the flexible configurability is realized. Existing processors support more or less partial configurations, but are not flexible enough. The configurations of the embodiment can be configured and modified at initialization and runtime, and have better flexibility than the existing processor.

Referring to fig. 1 and fig. 7, the instruction issue and execution unit in this embodiment includes an issue queue and an execution unit, where the issue queue includes a branch issue queue, a single-cycle integer issue queue, a multi-cycle integer issue queue, a vector/floating point issue queue, and a memory access issue queue, where the execution unit includes a branch execution unit, a single-cycle integer execution unit, a multi-cycle integer execution unit, a vector/floating point execution unit, and a memory access execution unit, the vector issue queue refers to a vector/floating point issue queue, the vector execution unit refers to a vector/floating point execution unit, and each type of instruction is issued and executed independently. The instruction transmitting and executing unit is divided into a branch instruction transmitting and executing unit, a single-cycle integer unit, a multi-cycle integer unit, a vector/floating point instruction transmitting and executing unit and an access instruction transmitting and executing unit according to different operand types, execution delay and the like; the method adopts the out-of-order transmission and out-of-order execution mode and follows the transmission execution principle of 'the best prepared oldest microinstruction is transmitted and executed preferentially'.

Referring to fig. 1 and 7, the register files in the present embodiment include an integer register file, a vector/floating point register file, a flag bit register file, a predicate register file, and a system register file. The register file is divided into an integer, a vector/floating point, a flag register, a predicate register and a system register file according to different types of execution operands, the register file is read as required when an instruction is transmitted, and the corresponding register file is written as required when the execution is finished.

As shown in fig. 3, this embodiment further provides an implementation method of the foregoing configurable processor supporting variable length vector processing, including: detecting whether an instruction identifier of a variable-length vector processing requirement exists in an application program in real time in the running process of the application program, if the instruction identifier of the variable-length vector processing requirement exists, configuring an option whether to support variable-length vector processing or not through a configurable unit to support variable-length vector processing, and specifying the maximum supported vector length to enter a variable-length vector processing supporting mode, otherwise configuring an option whether to support variable-length vector processing or not through the configurable unit to not support variable-length vector processing, and entering a traditional fixed-length vector processing mode; in the variable-length vector processing mode, a variable-length vector splitting module in an instruction decoding and renaming unit splits the variable-length vector processing operation so as to reuse the variable-length vector processing operation to an execution unit for processing the existing fixed-length vector, and the splitting of the variable-length vector splitting module in the instruction decoding and renaming unit for the variable-length vector processing operation specifically means splitting the variable-length vector processing operation as required in a decoding stage according to the relationship between the length L of the variable-length vector and a preset fixed vector length CL and splitting the variable-length vector into one or more instructions with the fixed vector length CL; the vector execution unit of the instruction issue and execution unit completes execution of the variable length vector for the variable length vector by one or more times, enabling reuse of the vector execution unit for existing fixed length vector processing.

The instruction identification of the variable-length vector processing requirement can be declared through programming and then generated at the time of program compiling, and the format of the instruction identification can be specified according to needs. The method comprises the steps that whether a variable-length vector processing requirement, namely an operation requirement of an application program exists in the application program, if the requirement exists, after the requirement is detected, a processor is configured to support variable-length vector processing, and the maximum supported vector length is specified, so that automatic selection execution of different modes can be realized.

During execution, the instruction obtaining unit obtains an instruction from the high-speed instruction cache and sends the instruction to the instruction decoding and renaming unit for further processing; the instruction decoding and renaming unit decodes and renames the instruction and sends the instruction to the instruction sequencing and dispatching unit for further processing; the instruction sequencing and dispatching unit is used for sequentially dispatching instructions to the instruction transmitting and executing unit and performing sequencing submission processing on the instructions finished out of order; the instruction transmitting and executing unit is used for caching instructions to be executed, respectively caching branch instructions, single-cycle integers, multi-cycle integers, vector/floating point instructions and access instructions into different transmitting queues according to different instruction types, independently selecting the corresponding type of instructions for transmitting in each transmitting queue, and periodically selecting the instructions meeting transmitting conditions to be transmitted to the related executing unit for execution; the transmitting queue is responsible for maintaining whether the instruction meets the transmitting condition or not, whether the execution needs to be cancelled or not after the instruction is transmitted or not, and the like; normal issue of instructions whose execution completes requires instruction sequencing and dispatch units to be notified to ensure that eventually all instructions can commit in order. In this embodiment, the step of executing the instruction in the variable length vector processing mode includes:

s1) instruction fetch: the instruction obtaining unit obtains M instructions from the cache instruction cache and stores the M instructions into the local cache each time according to the configured instruction fetching width M, for example, if the instruction fetching width M is 4, 4 instructions are obtained from the cache instruction cache and stored into the local cache each time; if configured to support static prediction or dynamic branch prediction, the instruction prefetcher will fetch more instructions to execute from either the static or dynamic prediction addresses, as shown in FIG. 4; it should be noted that the instruction fetch unit fetch speed is also limited by the decode speed, and if the decode cache is full or close to full, the instruction fetch needs to be slowed down or stopped appropriately to prevent cache overflow and instruction fetch errors, which can be realized by the instruction fetch unit detecting the decode cache status of the instruction decode and rename unit in real time.

S2) instruction decode and rename: as shown in fig. 5, the instruction decoding and renaming unit first pre-decodes one or more instructions each time according to a microinstruction splitting manner, so as to split the instructions into one or more microinstructions, assigns a global id to the microinstructions obtained by splitting, and stores the global id into a local pre-decoded microinstruction cache; then, N micro instructions are fetched from a local pre-decoding micro instruction cache each time according to the configured instruction decoding width N for decoding, and register operands in the micro instructions are renamed; assuming that the instruction decoding width N is 4, taking 4 micro instructions from the local pre-decoding micro instruction cache each time according to the configured instruction decoding width N for decoding, and renaming register operands in the micro instructions; the speed of the micro instruction decoding is dependent on the speed of the back end instruction dispatch, if the dispatch cache is full or close to full, the speed of the micro instruction decoding will be slow down or even blocked; in order to support variable-length vector processing, on the basis of realizing fixed-length vector processing, according to the relation between a target vector length L and a fixed vector length CL, the variable-length vector processing operation is split as required in a decoding stage, so that the reuse of an execution unit of the existing fixed-length vector processing is realized;

s3) instruction sequencing and dispatching: as shown in fig. 6, the instruction sequencing and dispatching unit obtains at most P instructions from the local to-be-dispatched cache each time according to the configured instruction dispatch width P, and dispatches the instructions to the instruction issue and execution unit based on the type of the instruction, and for all microinstructions that have been assigned with a unique id, allocates one entry in the re-sequencing cache for managing the life cycle state of the microinstruction; when the micro-execution corresponding to the id is finished in the execution unit, a finished state is returned to the instruction sequencing and dispatching unit, and the micro-instruction state is modified to be 'to be submitted'; when all the continuous microinstructions with a plurality of ids are in a state of waiting to be submitted, directly submitting the microinstructions by taking a group as a unit; it should be noted that, at most P micro instructions are sent to the instruction issue and execution unit each time, wherein whether to block the dispatch of some types of micro instructions is determined according to the type of the instruction and the fullness of the corresponding issue queue; in addition, the load balancing condition of the symmetrical execution components can be considered according to needs during allocation;

s4) instruction issue and execution: as shown in FIG. 7, the instruction issue and execution unit handles the different processing for the type of instruction dispatched: the branch instruction from the instruction sequencing and dispatching unit is cached in a branch transmitting queue, and is executed in a branch execution unit after being transmitted, the execution result of the branch instruction comprises a PC address for appointing the next instruction to be executed, and whether the branch prediction is invalid or not; the single-cycle instruction from the instruction sequencing and dispatching unit is cached in a single-cycle transmission queue and is executed in a single-cycle integer execution unit after transmission, and the execution result of the single-cycle instruction comprises a modification target integer register or a modification mark register; the multi-cycle instructions from the instruction sequencing and dispatching unit are cached in a multi-cycle emission queue and are executed in a multi-cycle execution unit after being emitted, and the execution result of the multi-cycle instructions comprises a modification target integer register or a modification flag register or a modification target system register; the vector/floating point instruction from the instruction sequencing and dispatching unit is cached in a vector/floating point transmitting queue and is executed in a vector/floating point execution unit after being transmitted, and the execution result of the vector floating point instruction comprises a modification destination vector/floating point register or a modification flag register; the access instruction from the instruction sequencing and dispatching unit is cached in the access transmitting queue and is executed in the access executing unit after being transmitted, the execution behavior of the access instruction is divided into loading data from a memory or writing data into the memory, and the execution results are respectively loading data into a target register and storing data into the memory. The vector/floating point execution unit supports fixed length vector processing operations and may also support some or all of the variable length vector processing.

In addition, the present embodiment may further adopt an assertion operand processing execution mechanism as needed, and reuse existing single-cycle and multi-cycle integer execution data paths, thereby reducing hardware implementation cost.

In summary, the present invention provides a configurable processor supporting variable length vector processing, which includes a configurable unit, an instruction fetch unit, an instruction decode and rename unit, an instruction sequencing and dispatch unit, an instruction issue and execution unit, a register file, a high speed instruction and data cache, and the like. The embodiment also provides a method for implementing a configurable processor supporting variable-length vector processing, which includes configuring basic states of the processor, including a vector execution mode and architecture parameters, according to the operation requirements, and starting the processor to start execution, wherein if the processor is already in the execution state, the vector execution mode and the architecture parameters can be dynamically modified according to the requirements; an instruction fetch unit fetches instructions from an instruction cache; the instruction decoding and renaming unit decodes and renames the register of the instruction; the instruction sequencing and dispatching unit is used for dispatching the instructions to the instruction transmitting and executing unit in sequence and sequencing and submitting the instructions which are executed out of order; the instruction transmitting and executing unit is used for caching the instruction to be executed and periodically selecting the instruction meeting the transmitting condition to transmit to the corresponding executing unit for execution; the register file is used for storing an instruction execution result and a source register type operand required by the instruction execution; the cache instructions and data caches are then used to cache instructions and data from the interconnect or external storage. The invention designs a configurable processor and an implementation method thereof for supporting variable-length vector processing, which can simultaneously support fixed-length and variable-length vector processing and further improve the flexibility of the processor.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application, wherein the instructions that execute via the flowcharts and/or processor of the computer program product create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:使用脉动阵列在并行计算机器中计算高效的跨通道运算

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!