Block-based processor core composition register

文档序号：303613 发布日期：2021-11-26 浏览：39次中文

阅读说明：本技术 基于块的处理器核复合寄存器 (Block-based processor core composition register ) 是由 D·C·伯格 A·L·史密斯于 2016-09-13 设计创作，主要内容包括：本公开的实施例涉及基于块的处理器核复合寄存器。公开了与基于块的处理器核复合寄存器相关的系统、装置和方法。在所公开的技术的一个示例中,处理器可以包括多个基于块的处理器核,多个基于块的处理器核用于执行包括多个指令块的程序。相应的基于块的处理器核可以包括一个或多个可共享资源和可编程复合控制寄存器。可编程复合控制寄存器可以用于配置一个或多个可共享资源中的哪些资源与多个处理器核中的其他处理器核共享。(Embodiments of the present disclosure relate to block-based processor core composition registers. Systems, apparatuses, and methods related to block-based processor core composition registers are disclosed. In one example of the disclosed technology, a processor may include a plurality of block-based processor cores for executing a program including a plurality of instruction blocks. The respective block-based processor core may include one or more sharable resources and a programmable composition control register. The programmable composition control register may be used to configure which of the one or more shareable resources are shared with other processor cores of the plurality of processor cores.)

1. A processor, comprising:

a plurality of physical processor cores for executing a program comprising a plurality of instruction sets, a given core of the physical processor cores configured to execute the instruction set of a given thread of the program, the given core comprising a data cache configurable to: at the same time that the second resource of the given core is powered down, powered up and sharable between two or more of the plurality of physical processor cores;

a programmable composition control register in the given core comprising respective bits for: (i) enabling the data cache incorporated in the given core to be powered up and shareable with one or more other cores of the physical processor cores, (ii) a second resource of the given core to be powered up, and (iii) the second resource of the given core to be available for sharing, the programmable composition control register adapted to be programmed at runtime of the program; and

a programmable composition topology register comprising a single bit for: the given one of the physical processor cores, a second core, and other cores, the single bit collectively indicating which of the physical processor cores are grouped together;

wherein the data cache of the given core is shared with the second core while the second resource of the given core is powered down based on the first encoding of the programmable composition control register and the enabling of the single bit for the given core and the second core;

wherein the data cache of the given core is shared with the second core while the second resources of the given core are powered up and the given core and the second core are grouped into logical cores based on a second encoding of the programmable composition control register and an enabling of the single bit for the given core and the second core; and is

Wherein the data cache of the given core is shared with the second core while the second resource of the given core is powered up and excluded from being grouped with the second core based on a third encoding of the programmable composition control register and the enabling of the single bit for the given core and the second core.

2. The processor of claim 1, wherein the programmable composition topology register is partially programmable by: executing a store instruction of an instruction block executing on any of the plurality of physical processor cores to a memory-mapped address corresponding to the programmable composition topology register.

3. The processor of claim 2, wherein a value of the programmable composition topology register is updated during a commit phase of the instruction block of the store instruction having the memory-mapped address corresponding to the programmable composition topology register.

4. The processor of claim 1, wherein the processor further comprises one or more additional programmable composition control registers for a shareable register file or a shareable execution unit.

5. A processor, comprising:

a plurality of processor cores for executing a program comprising a plurality of instruction sets, a given core of the processor cores comprising a plurality of shareable resources including a data cache and additional resources;

one or more programmable registers in the given core including respective bits for: (i) configuring the data cache to be powered up and enabled to be shared with other cores of the processor core, (ii) to control powering down of the additional resources, and (iii) to control whether the given core is to be composited with one or more of the other cores into a logical processor, the one or more programmable registers being programmable during execution of the program; and

a complex topology register comprising a single bit for: the given one of the processor cores, a second core, and other cores, the single bit collectively indicating which of the processor cores are grouped together;

wherein the data cache of the given core is shared with the second core while the additional resources of the given core are powered down based on the first encoding of the programmable register and the enabling of the single bit for the given core and the second core;

wherein the data cache of the given core is shared with the second core while the additional resources of the given core are powered up and the given core and the second core are grouped into logical processor cores based on a second encoding of the programmable register and enabling of the single bit for the given core and the second core; and is

Wherein the data cache of the given core is shared with the second core while the additional resources of the given core are powered up and excluded from being grouped with the second core based on a third encoding of the programmable register and the enabling of the single bit for the given core and the second core.

6. The processor of claim 5, wherein the logical processor core is configured to execute a single thread of the program.

7. The processor of claim 5, wherein:

the data cache comprises a first level data cache;

with the third encoding of the programmable register, the given core is configured to execute a first thread of the program and the second core is configured to execute a second thread of the program.

8. The processor of claim 5, wherein the plurality of shareable resources entirely includes the given core.

9. The processor of claim 5, wherein the one or more programmable registers are programmable to disable particular ones of the plurality of shareable resources during operation of the given core.

10. The processor of claim 5, wherein the data cache has a private default state and a second resource of the plurality of shareable resources has a shared default state.

11. The processor of claim 5, wherein the one or more programmable registers are further configured to enable sharing of a second resource of the plurality of shareable resources with a second core of the processor cores, the second core external to the logical processor core.

12. The processor of claim 5, wherein the plurality of shareable resources comprise execution units.

13. The processor of claim 5, wherein the plurality of shareable resources comprises a register bank.

14. A method of executing a program on a processor comprising a plurality of processor cores, each processor core comprising one or more sharable resources, the method comprising:

configuring a first core of the processor cores to operate according to a control register of the first core, respective bits of the control register being programmed to control: (i) a given sharable memory resource of the first core is powered up and enabled for sharing with a second core of the processor cores, (ii) a second resource of the first core is powered up, and (iii) the second resource of the first core is available for sharing;

executing instructions of an instruction block on the second core using the given sharable memory resource of the first core;

configuring a complex topology register to have a single bit for: the first one of the processor cores, the second one of the processor cores, and other cores, the single bit collectively indicating which of the processor cores are grouped together;

in response to a first encoding of the control register and enabling of the single bit for the first core and the second core: sharing the given shareable memory resource of the first core with the second core while the second resource of the first core is powered off;

in response to a second encoding of the control register and enabling of the single bit for the first core and the second core: sharing the given shareable memory resource of the first core with the second core while the second resource of the first core is powered up and the first and second cores are grouped into logical cores; and

in response to a third encoding of the control register and enabling of the single bit for the first core and the second core: sharing the given shareable memory resource of the first core with the second core while the second resource of the first core is powered up and excluded from being grouped with the second core;

wherein the given sharable memory resource is a data cache.

15. The method of claim 14, wherein the given sharable memory resource and a corresponding sharable memory resource of the second core form a first bank and a second bank, respectively, of a logical cache.

16. The method of claim 15, wherein a memory address comprises an index field, and the index field is used to determine whether the memory address is mapped to the first bank or the second bank.

Technical Field

Embodiments of the present disclosure relate to block-based processor core composition registers.

Background

Microprocessors have gained in from the continual increase in transistor count, integrated circuit cost, manufacturing capital, clock frequency, and energy efficiency due to the continual transistor expansion predicated by moore's law, with little change in the associated processor Instruction Set Architecture (ISA). However, the benefits realized from the lithographic extensions that have driven the semiconductor industry over the past 40 years are slowing or even reversing. Reduced Instruction Set Computing (RISC) architectures have been the dominant paradigm in processor design for many years. Out-of-order superscalar implementations have not demonstrated sustained improvements in area or performance. Thus, there is ample opportunity for processor ISA improvements that extend performance improvements.

Disclosure of Invention

Methods, apparatus, and computer-readable storage devices are disclosed for assembling one or more resources of a processor core into a larger logical processing core. The processor core implements a block-based processor instruction set architecture (BB-ISA). The described techniques and tools can potentially improve processor performance and may be implemented separately from one another, or in various combinations with one another. As will be described more fully below, the described techniques and tools may be implemented in: a digital signal processor, a microprocessor, an Application Specific Integrated Circuit (ASIC), a soft processor (e.g., a microprocessor core implemented in a Field Programmable Gate Array (FPGA) using reconfigurable logic), programmable logic, or other suitable logic circuitry. As will be readily apparent to one of ordinary skill in the art, the disclosed techniques may be implemented in a variety of computing platforms, including but not limited to servers, mainframes, cell phones, smartphones, PDAs, handheld devices, handheld computers, touch screen tablet devices, tablet computers, wearable computers, and laptop computers.

In some examples of the disclosed technology, a processor may include a plurality of block-based processor cores for executing a program including a plurality of instruction blocks. A respective block-based processor core may include a programmable composition control register and one or more sharable resources. The programmable composition control register may be used to configure which of the one or more shareable resources are shared with other processor cores of the plurality of processor cores.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. The foregoing and other objects, features, and advantages of the disclosed subject matter will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.

Drawings

Fig. 1 illustrates a block-based processor including multiple processor cores as may be used in some examples of the disclosed technology.

Fig. 2 illustrates a block-based processor core as may be used in some examples of the disclosed technology.

FIG. 3 illustrates a plurality of instruction blocks in accordance with certain examples of the disclosed technology.

FIG. 4 illustrates portions of source code and corresponding instruction blocks.

Fig. 5 illustrates block-based processor headers and instructions as may be used in some examples of the disclosed technology.

FIG. 6 is a flow diagram illustrating an example of the progression of the state of a processor core in a block-based processor.

FIG. 7 is a diagram illustrating an example system including a processor having multiple block-based processor cores.

Figure 8 is a diagram illustrating an example of a block-based processor core including shareable and non-shareable resources.

FIG. 9 is a diagram illustrating example control registers of a block-based processor core.

10-12 illustrate examples of multicore processors configured such that resources are shared among physical processor cores.

Fig. 13 and 14 are flowcharts illustrating example methods of programming a processor to execute threads using logical processors and/or logical resources as may be performed in some examples of the disclosed technology.

Fig. 15-17 are flowcharts illustrating example methods of configuring and/or executing threads of a program on a logical processor and/or logical resources as may be used in some examples of the disclosed technology.

FIG. 18 is a block diagram illustrating a suitable computing environment for implementing some embodiments of the disclosed technology.

Detailed Description

I.General considerations of

The present disclosure is set forth in the context of representative embodiments that are not intended to be limiting in any way.

As used in this application, the singular forms "a", "an" and "the" include the plural forms unless the context clearly dictates otherwise. Furthermore, the term "comprising" means "including". Moreover, the term "coupled" encompasses mechanical, electrical, magnetic, optical, and other practical ways of coupling or linking items together, and does not exclude the presence of intermediate elements between coupled items. Additionally, as used herein, the term "and/or" means any one or combination of items in the phrase.

The systems, methods, and apparatus described herein should not be construed as limiting in any way. Rather, the present disclosure is directed to all novel and non-obvious features and aspects of the various disclosed embodiments, alone and in various combinations and subcombinations with one another. The disclosed systems, methods, and apparatus are not limited to any specific aspect or feature or combination thereof, nor do the disclosed systems and methods require that any one or more specific advantages be present or problems be solved. Furthermore, any features or aspects of the disclosed embodiments can be used in various combinations and subcombinations with one another.

Although the operations of some of the disclosed methods are described in a particular, sequential order for convenient presentation, it should be understood that this manner of description encompasses rearrangement, unless a particular ordering is required by specific language set forth below. For example, operations described sequentially may in some cases be rearranged or performed in parallel. Moreover, for the sake of simplicity, the attached figures may not show the various ways in which the disclosed content and methods can be used in conjunction with other content and methods. Moreover, the description sometimes uses terms like "generating," "displaying," "receiving," "transmitting," "authenticating," "performing," and "initiating" to describe the disclosed methods. These terms are high-level descriptions of the actual operations performed. The actual operations corresponding to these terms will vary depending on the particular implementation and are readily discernible by one of ordinary skill in the art.

Theory of operation, scientific principles, or other theoretical descriptions presented herein with reference to the apparatus or methods of the present disclosure have been provided for purposes of better understanding and are not intended to be limiting in scope. The apparatus and methods in the appended claims are not limited to those apparatus and methods implemented in the manner described by such theories of operation.

Any of the disclosed methods may be implemented as computer-executable instructions stored on one or more computer-readable media (e.g., computer-readable media such as one or more optical media discs, volatile memory components such as DRAM or SRAM, or non-volatile memory components such as hard drives) and executed on a computer (e.g., any commercially available computer, including smart phones or other mobile devices that include computing hardware). Any of the computer-executable instructions for implementing the disclosed techniques, as well as any data created and used during implementation of the disclosed embodiments, can be stored on one or more computer-readable media (e.g., computer-readable storage media). The computer-executable instructions may be, for example, a dedicated software application or a portion of a software application that is accessed or downloaded via a web browser or other software application, such as a remote computing application. Such software can be executed, for example, on a single local computer (e.g., with a general-purpose and/or block-based processor executing on any suitable commercially-available computer) or in a network environment (e.g., via the internet, a wide-area network, a local-area network, a client-server network (such as a cloud computing network), or other such network) using one or more network computers.

For clarity, only certain selected aspects of the software-based implementation are described. Other details well known in the art are omitted. For example, it should be understood that the disclosed technology is not limited to any particular computer language or program. For example, the disclosed techniques may be implemented in C, C + +, JAVA, or any other suitable programming language. Likewise, the disclosed technology is not limited to any particular computer or type of hardware. Certain details of suitable computers and hardware are well known and need not be set forth in detail in this disclosure.

Further, any of the software-based embodiments (including, for example, computer-executable instructions for causing a computer to perform any of the disclosed methods) may be uploaded, downloaded, or remotely accessed through suitable communication means. Such suitable communication means include, for example, the internet, the world wide web, an intranet, software applications, cable (including fiber optic cable), magnetic communication, electromagnetic communication (including RF, microwave, and infrared communication), electronic communication, or other such communication means.

II.Brief introduction to the disclosed technology

Superscalar out-of-order microarchitecture employs a large amount of circuit resources to rename registers, schedule instructions in data flow order, clean up after misspeculation, and retire results against precise exceptions. This includes expensive energy consuming circuitry such as a deep register file of many ports, a Content Accessible Memory (CAM) of many ports for data stream instruction scheduling wake-up, and many wide bus multiplexers and bypass networks, all of which are resource intensive. For example, FPGA-based implementations of multiple read, multiple write RAMs typically require a mix of replication, multi-cycle operation, clock doubling, group interleaving, live value tables, and other expensive techniques.

The disclosed techniques may achieve energy efficiency and/or performance enhancements by applying techniques including high instruction set parallelism (ILP), out-of-order (OoO), superscalar execution, while avoiding substantial complexity and overhead in both processor hardware and associated software. In some examples of the disclosed technology, a block-based processor including multiple processor cores uses an Explicit Data Graph Execution (EDGE) ISA designed for region and energy efficient high ILP execution. In some examples, the use of the EDGE architecture and associated compiler cleverly handles the remoteness and complexity of renaming the registers of the CAM. In some examples, respective cores of a block-based processor may store or cache fetched and decoded instructions that may be repeatedly executed, and the fetched and decoded instructions may be reused to potentially achieve reduced power and/or increased performance.

In some examples of the disclosed technology, the EDGE ISA may eliminate the need for one or more complex architectural features, including register renaming, data flow analysis, mis-speculation recovery, and in-order retirement, while supporting mainstream programming languages (such as C and C + +). In some examples of the disclosed technology, a block-based processor executes multiple (two or more) instructions as atomic blocks. Block-based instructions may be used to express the semantics of program data streams and/or instruction streams in a more explicit manner, which allows for improved compiler and processor performance. In some examples of the disclosed technology, an explicit data graph execution instruction set architecture (EDGE ISA) includes information about program control flow that can be used to improve detection of inappropriate control flow instructions, thereby increasing performance, saving memory resources, and/or saving energy.

In some examples of the disclosed technology, instructions organized within instruction blocks are atomically fetched, executed, and committed. Intermediate results produced by instructions in the atomic instruction block are locally buffered until the instruction block is committed. When an instruction block is committed, updates to the visible architectural state resulting from executing instructions of the instruction block are visible to other instruction blocks. Instructions within the block are executed in dataflow order, which reduces or eliminates using register renaming and provides power efficient OoO execution. Compilers may be used to explicitly encode data dependencies through the ISA, which reduces or eliminates the burden on processor core control logic to rediscover dependencies at runtime. With predicated execution, intra-block branches may be converted to dataflow instructions, and dependencies other than memory dependencies may be limited to direct data dependencies. The disclosed target form encoding techniques allow instructions within a block to pass their operands directly via an operand buffer, which reduces access to power-hungry multi-port physical register files.

Between instruction blocks, instructions may communicate using visible architectural state such as memory and registers. Thus, by utilizing a mixed data stream execution model, the EDGE architecture can still support imperative programming language and sequential memory semantics, but desirably also enjoys the benefit of out-of-order execution with near-sequential power efficiency and complexity.

Blocks of instructions of a single-threaded or multi-threaded program may be executed in parallel using a multi-core processor that may be dynamically adapted to a given workload. For example, the physical cores may be combined into a larger logical processor, the logical processor may be divided into smaller logical processors or physical cores, and/or resources may be shared among the physical cores. A logical processor may include one or more physical cores that coordinate with each other to execute blocks of instructions within threads of a program. By dynamically adjusting the resource sharing of the composite and physical processor cores of the logical processor, programs may execute faster and/or use less energy than processors that do not have the ability to dynamically adapt to a given workload.

As one of ordinary skill in the art will readily appreciate, the range of implementations of the disclosed techniques is possible with various region, performance, and power tradeoffs.

III.Exemplary Block-based processor

Fig. 1 is a block diagram 10 of a block-based processor 100 as may be implemented in some examples of the disclosed technology. Processor 100 is configured to execute atomic instruction blocks according to an Instruction Set Architecture (ISA), which describes several aspects of processor operations, including a register model, several defined operations performed by block-based instructions, a memory model, interrupts, and other architectural features. The block-based processor includes a plurality of processor cores 110, including processor core 111.

As shown in fig. 1, the processor cores are connected to each other via a core interconnect 120. Core interconnect 120 carries data and controls signals between individual ones of cores 110, memory interface 140, and input/output (I/O) interface 145. Core interconnect 120 may transmit and receive signals using electrical, optical, magnetic, or other suitable communication techniques, and may provide communication connections according to several different topological arrangements depending on the particular desired configuration. For example, core interconnect 120 may have a crossbar, bus, point-to-point bus, or other suitable topology. In some examples, any one of cores 110 may be connected to any one of the other cores, while in other examples, some cores are connected to only a subset of the other cores. For example, each core may be connected to only the nearest 4, 8, or 20 neighboring cores. Core interconnect 120 may be used to transfer input/output data to and from the cores, and to transfer control signals and other information signals to and from the cores. For example, each of the cores 110 may receive and transmit a semaphore indicating an execution state of an instruction currently being executed by each of the respective cores. In some examples, core interconnect 120 is implemented as a wire connecting cores 110 and a memory system, while in other examples, the core interconnect may include circuitry, switches, and/or routing components for multiplexing data signals on the interconnect wire(s), including active signal drivers and repeaters, or other suitable circuitry. In some examples of the disclosed technology, signals within the processor 100 and/or to/from the processor 100 are not limited to full-swing electrical digital signals, but the processor may be configured to include differential signals, pulsed signals, or other suitable signals for communicating data and control signals.

In the example of fig. 1, memory interface 140 of the processor includes interface logic used to connect to additional memory (e.g., memory located on another integrated circuit than processor 100). As shown in FIG. 1, external memory system 150 includes L2 cache 152 and main memory 155. In some examples, the L2 cache may be implemented using static ram (sram), and the main memory 155 may be implemented using dynamic ram (dram). In some examples, memory system 150 is included on the same integrated circuit as the other components of processor 100. In some examples, memory interface 140 includes a Direct Memory Access (DMA) controller that allows data blocks in memory to be transferred without using register file(s) and/or processor 100. In some examples, memory interface 140 may include a Memory Management Unit (MMU) to manage and allocate virtual memory, extending available main memory 155.

The I/O interface 145 includes circuitry for receiving and sending input signals and output signals to other components, such as hardware interrupts, system control signals, peripheral interfaces, coprocessor control and/or data signals (e.g., signals for a graphics processing unit, floating point coprocessor, physical processing unit, digital signal processor, or other co-processing component), clock signals, semaphores, or other suitable I/O signals. The I/O signals may be synchronous or asynchronous. In some examples, all or part of the I/O interface is implemented using memory mapped I/O technology in conjunction with memory interface 140.

The block-based processor 100 may also include a control unit 160. The control unit may communicate with processing cores 110, I/O interface 145, and memory interface 140 via core interconnect 120 or a sideband interconnect (not shown). Control unit 160 oversees the operation of processor 100. Operations that may be performed by the control unit 160 may include allocation and de-allocation of cores for performing instruction processing; control of input data and output data between any of the cores, register files, memory interface 140, and/or I/O interface 145; a modification to the execution flow; and verifying target location(s) of branch instructions, instruction headers, and other changes in control flow. The control unit 160 may also handle hardware interrupts and control the reading and writing of special system registers (e.g., program counters stored in one or more register files). In some examples of the disclosed technology, control unit 160 is implemented at least in part using one or more of processor cores 110, while in other examples, control unit 160 is implemented using a non-block-based processor core (e.g., a general-purpose RISC processing core coupled to memory). In some examples, control unit 160 is implemented at least in part using one or more of: a hardwired finite state machine, programmable microcode, programmable gate array, or other suitable control circuitry. In alternative examples, the control unit functions may be performed by one or more of cores 110.

The control unit 160 includes a scheduler for allocating instruction blocks to the processor cores 110. As used herein, scheduler allocation involves hardware used to direct the operation of instruction blocks, including initiating instruction block mapping, instruction fetching, decoding, executing, committing, aborting, idling, and flushing instruction blocks. In some examples, hardware receives signals generated using computer-executable instructions to direct the operation of an instruction scheduler. Processor cores 110 are assigned to instruction blocks during instruction block mapping. The narrative stages of instruction operations are for illustrative purposes, and in some examples of the disclosed technology, certain operations may be combined, omitted, separated into multiple operations, or additional operations added.

Block-based processor 100 also includes a clock generator 170 that distributes one or more clock signals to various components within the processor (e.g., cores 110, interconnects 120, memory interface 140, and I/O interface 145). In some examples of the disclosed technology, all components share a common clock, while in other examples, different components use different clocks (e.g., clock signals having different clock frequencies). In some examples, a portion of the clock is gated to allow power savings when some of the processor components are not in use. In some examples, the clock signal is generated using a Phase Locked Loop (PLL) to generate a signal having a fixed constant frequency and duty cycle. The circuits receiving the clock signal may be triggered on a single edge (e.g., a rising edge), while in other examples at least some of the circuits in the receiving circuit are triggered by rising and falling clock edges. In some examples, the clock signal may be transmitted optically or wirelessly.

IV.Example Block-based processor core

Fig. 2 is a block diagram 200 further detailing an example microarchitecture for block-based processor 100 (and in particular, an instance of one of the block-based processor cores (processor core 111)) as may be used in certain examples of the disclosed technology. For ease of explanation, exemplary block-based processor core 111 is illustrated with five stages: instruction Fetch (IF), Decode (DC), operand fetch, Execute (EX), and memory/data access (LS). However, one of ordinary skill in the art will readily appreciate that modifications to the illustrated microarchitecture (such as the addition/removal phase, the addition/removal of units performing operations, and other implementation details) may be modified to suit a particular application for the block-based processor.

In some examples of the disclosed technology, processor core 111 may be used to execute and commit instruction blocks of a program. An instruction block is an atomic set of block-based processor instructions that includes an instruction block header and a plurality of instructions. As will be discussed further below, the instruction block header may include information describing an execution mode of the instruction block and information that may be used to further define semantics of one or more of the plurality of instructions in the instruction block. Depending on the particular ISA and processor hardware used, instruction block headers may also be used during execution of instructions to improve performance of executing instruction blocks by, for example, allowing early fetching of instructions and/or data, improved branch prediction, speculative execution, improved energy efficiency, and improved code compactness.

The instructions of the instruction block may be data flow instructions that explicitly encode a relationship between producer-consumer instructions of the instruction block. In particular, instructions may communicate results directly to the target instruction through an operand buffer reserved only for the target instruction. The intermediate results stored in the operand buffers are typically not visible to cores external to the execution core, because the block atomic execution model only passes final results between instruction blocks. When an instruction block is committed, the final result resulting from executing the instructions of the atomic instruction block becomes visible outside the execution core. Thus, the visible architectural state generated by each instruction block may appear as a single transaction outside of the execution core, and intermediate results are typically not observable outside of the execution core.

As shown in fig. 2, processor core 111 includes a control unit 205, and control unit 205 may receive control signals from other cores and generate control signals to regulate core operations and schedule instruction streams within the cores using instruction scheduler 206. The control unit 205 may include state access logic 207 to check the core state and/or configure the operating mode of the processor core 111. Control unit 205 may include execution control logic 208 to generate control signals during one or more modes of operation of processor core 111. Operations that may be performed by control unit 205 and/or instruction scheduler 206 may include allocation and de-allocation of cores for performing instruction processing; control of input data and output data between any of the cores, register file, memory interface 140, and/or I/O interface 145. The control unit 205 may also handle hardware interrupts and control the reading and writing of special system registers, such as program counters stored in one or more register files. In other examples of the disclosed technology, control unit 205 and/or instruction scheduler 206 are implemented using a non-block-based processing core (e.g., a general purpose RISC processing core coupled to memory). In some examples, control unit 205, instruction scheduler 206, state access logic 207, and/or execution control logic 208 are implemented, at least in part, using one or more of: hard-wired finite state machines, programmable microcode, programmable gate arrays, or other suitable control circuitry.

The control unit 205 may decode the instruction block header to obtain information about the instruction block. For example, the execution mode of an instruction block may be specified in the instruction block header by various execution flags. The decoded execution mode may be stored in a register of execution control logic 208. Based on the execution mode, execution control logic 208 may generate control signals to regulate core operations and schedule instruction flow within core 111, such as by using instruction scheduler 206. For example, during a default execution mode, execution control logic 208 may order instructions of one or more instruction blocks executing on one or more instruction windows (e.g., 210, 211) of processor core 111. In particular, each instruction may be ordered by instruction fetch, decode, operand fetch, execute, and memory/data access stages, such that instructions of an instruction block may be pipelined and executed in parallel. The instructions are ready for execution when their operands are available, and the instruction scheduler 206 may select the order in which the instructions are executed.

State access logic 207 may include interfaces for other cores and/or processor-level control units (such as control unit 160 of fig. 1) to communicate with core 111 and access the state of core 111. For example, state access logic 207 may be connected to a core interconnect (such as core interconnect 120 of fig. 1), and other cores may communicate via control signals, messages, read and write registers, and so forth.

State access logic 207 may include control state registers or other logic to modify and/or check the mode and/or state of an instruction block and/or the core state. As examples, the core state may indicate whether an instruction block is mapped to core 111 or an instruction window of core 111 (e.g., instruction windows 210, 211), whether an instruction block resides on core 111, whether an instruction block executes on core 111, whether an instruction block is ready to commit, whether an instruction block is performing a commit, and whether an instruction block is free. As another example, the state of an instruction block may include a flag or flag indicating that the instruction block is the oldest instruction block executed and a flag indicating that the instruction block is speculatively executing. As another example, the state of the processor core may include a power state, such as a power state for individual resources of the processor core or the entire processor core.

A Control Status Register (CSR) may be mapped to a unique memory location reserved for use by a block-based processor. For example, the CSR of control unit 160 (fig. 1) may be assigned to a first address range, the CSR of memory interface 140 (fig. 1) may be assigned to a second address range, the first processor core may be assigned to a third address range, the second processor core may be assigned to a fourth address range, and so on. In one embodiment, the CSR may be accessed using general purpose memory load and store instructions of a block-based processor. Additionally or alternatively, the CSR may be accessed using specific read and write instructions for the CSR (e.g., instructions for read/write CSR have different instructions than memory load and store instructions). Thus, one core may check the configuration status of a different core by reading from an address corresponding to the CSR of the different core. Similarly, one core may modify the configuration state of a different core by writing to an address corresponding to the CSR of the different core. The CSR may be protected or controlled so that only trusted software components can read and write to the CSR. For example, a CSR can only be accessed from privileged or kernel mode. Additionally or alternatively, the CSRs may be accessed by shifting commands into state access logic 207 through a serial scan chain, such as through a Joint Test Action Group (JTAG) test access port and boundary scan logic. In this manner, one core may check the state access logic 207 of different cores, and one core may modify the state access logic 207 or mode of different cores.

As further described herein, state access logic 207 may include registers or other logic to configure and/or reconfigure cores to operate in different operating modes. For example, state access logic 207 may include a composite control register writable by CSR that controls which resources of a processor core may be shared with other processor cores. The composition control register may also specify whether the processor core is to be composited as part of a logical processor. The composition control register may also control the power state of the processor core. As another example, state access logic 207 may include a composite topology register writable by CSR that controls which processor cores may be grouped with other processor cores.

Each of instruction windows 210 and 211 may receive instructions and data from one or more of input ports 220, 221, and 222 (which are connected to the interconnect bus) and instruction cache 227 (which is in turn connected to instruction decoders 228 and 229). Additional control signals may also be received on additional input ports 225. Each of instruction decoders 228 and 229 decodes instructions for use in an instruction block and stores the decoded instructions within memory banks 215 and 216 located in each respective instruction window 210 and 211.

Processor core 111 also includes a register file 230 coupled to an L1 (first level) cache 235. Register file 230 stores data for registers defined in the block-based processor architecture and may have one or more read ports and one or more write ports. For example, a register file may include two or more write ports for storing data in the register file, and have multiple read ports for reading data from individual registers within the register file. In some examples, a single instruction window (e.g., instruction window 210) may access only one port of the register file at a time, while in other examples, instruction window 210 may access one read port and one write port, or may access two or more read ports and/or write ports simultaneously. In some examples, register file 230 may include 64 registers, each of which holds a word of 32-bits of data. (unless otherwise specified, this application will refer to 32 bits of data as words). In some examples, some of the registers within register file 230 may be allocated for special purposes. For example, some of the registers may be dedicated as system register examples, including registers that store a constant value (e.g., all zero words), a Program Counter (PC) or counters (indicating the current address of the program thread being executed), a number of physical cores, a number of logical cores, a core allocation topology, a core control flag, a processor topology, or other suitable dedicated purpose. In some examples, there are multiple program counter registers, one or each program counter, to allow concurrent execution of multiple execution threads across one or more processor cores and/or processors. In some examples, the program counter is implemented as a specified memory location, rather than a register in a register file. In some examples, use of system registers may be restricted by an operating system or other supervisory computer instructions. In some examples, register file 230 is implemented as a flip-flop array, while in other examples, the register file may be implemented using latches, SRAM, or other forms of memory storage. The ISA specification for a given processor (e.g., processor 100) specifies how registers within register file 230 are defined and used.

In some examples, processor 100 includes a global register file shared by multiple processor cores. In some examples, individual register files associated with processor cores may be combined to form larger files statically or dynamically, depending on the processor ISA and configuration.

As shown in FIG. 2, the memory store 215 of the instruction window 210 includes a number of decoded instructions 241, a Left Operand (LOP) buffer 242, a Right Operand (ROP) buffer 243, and an instruction scoreboard 245. In some examples of the disclosed technology, each instruction in an instruction block is decomposed into a line of decoded instructions, left and right operands, and scoreboard data, as shown in fig. 2. Decoded instruction 241 may include a partially or fully decoded version of the instruction stored as a bit-level control signal. Operand buffers 242 and 243 store operands (e.g., register values received from register file 230, data received from memory, intermediate operands encoded within the instruction, operands computed by earlier issued instructions, or other operand values) until their corresponding decoded instructions are ready for execution. Instruction operands are read from operand buffers 242 and 243 instead of the register file.

The memory store 216 of the second instruction window 211 stores similar instruction information (decoded instructions, operands, and scoreboards) as the memory store 215, but is not shown in FIG. 2 for simplicity reasons. Instruction blocks may be executed concurrently or sequentially with respect to the first instruction window by the second instruction window 211, subject to ISA constraints and as directed by the control unit 205.

In some examples of the disclosed technology, the front-end pipeline stages IF and DC may be decoupled from the back-end pipeline stage (IS, EX, LS) operations. In one embodiment, the control unit may fetch and decode two instructions per clock cycle into each of instruction windows 210 and 211. In alternative embodiments, the control unit may fetch and decode one, four, or another number of instructions per clock cycle into a corresponding number of instruction windows. The control unit 205 uses the scoreboard 245 to provide instruction window dataflow scheduling logic to monitor the ready state of the inputs (e.g., the assertion(s) and operand(s) of each respective instruction) of each decoded instruction. An instruction is ready to issue when all inputs to a particular decoded instruction are ready. Control logic 205 then initiates execution of one or more next instructions (e.g., the lowest numbered ready instruction) per cycle, and its decoded instructions and input operands are sent to one or more of functional units 260 for execution. The decoded instructions may also encode several ready events. The scheduler in control logic 205 accepts these and/or events from other sources and updates the ready state of other instructions in the window. Thus, the execution starts with 111 ready zero-input instructions of the processor core, continues with instructions targeting the zero-input instructions, and so on.

Decoded instructions 241 need not be executed in the same order in which they are arranged within memory store 215 of instruction window 210. In contrast, the instruction scoreboard 245 is used to track dependencies of decoded instructions, and when a dependency has been met, the associated individual decoded instruction is scheduled for execution. For example, when a dependency has been satisfied for a respective instruction, a reference to the respective instruction may be pushed onto a ready queue, and the instructions may be scheduled from the ready queue in first-in-first-out (FIFO) order. The information stored in the scoreboard 245 may include, but is not limited to, an execution predicate of the associated instruction (such as whether the instruction is waiting for a predicate bit to be computed and whether the instruction executes if the predicate bit is true or false), availability of operands to the instruction, or other preconditions required before executing the associated individual instruction.

In one embodiment, the scoreboard 245 may include: a decode ready state, initialized by instruction decoder 228; and an active ready state, which is initialized by the control unit 205 during execution of the instruction. For example, the decode ready state may encode whether the corresponding instruction has been decoded, awaits an assertion and/or some operands (perhaps via a broadcast channel), or is immediately ready to issue. The active ready state may encode whether the corresponding instruction waits for an assertion and/or some operands, is ready to issue, or has issued. The decode ready state may be cleared upon a block reset or a block refresh. Upon branching to a new instruction block, the decode ready state and the active ready state are cleared (block or core reset). However, when an instruction block is re-executed on a core (such as when it branches back to itself (block refresh)), only the active ready state is cleared. The block flush may occur immediately (when an instruction block branches to itself), or after execution of several other intermediate instruction blocks. The decode ready state of the instruction block may thus be preserved so that it is not necessary to re-fetch and decode the instructions of the block. Thus, block refresh can be used to save time and energy in loops and other repetitive program structures.

The number of instructions stored in each instruction window typically corresponds to the number of instructions within an instruction block. In some examples, the number of instructions within an instruction block may be 32, 64, 128, 1024, or another number of instructions. In some examples of the disclosed technology, instruction blocks are allocated across multiple instruction windows within a processor core. In some examples, the instruction windows 210, 211 may be logically partitioned such that multiple instruction blocks may be executed on a single processor core. For example, one, two, four, or another number of instruction blocks may be executed on one core. The respective instruction blocks may be executed concurrently with each other or sequentially.

Instructions may be allocated and scheduled using control unit 205 located within processor core 111. The control unit 205 arranges fetching of instructions from the memory, decoding of the execution, execution of instructions when they have been loaded into the corresponding instruction window, data flow into/out of the processor core 111, and controls signals input and output by the processor core. For example, the control unit 205 may include a ready queue as described above for use in scheduling instructions. Instructions stored in memory banks 215 and 216 located in each respective instruction window 210 and 211 may be executed atomically. Thus, updates to visible architectural state (such as register file 230 and memory) affected by executed instructions may be cached locally within the core until the instructions are committed. The control unit 205 may determine when an instruction is ready to commit, order the commit logic, and issue a commit signal. For example, the commit phase of an instruction block may begin when all register writes are buffered, all writes to memory are buffered, and a branch target is computed. When the update to the visible architectural state is complete, the instruction block may be committed. For example, an instruction block may be committed when a register write is written to the register file, a store is sent to the load/store unit or memory controller, and a commit signal is generated. The control unit 205 also controls, at least in part, the assignment of the functional units 260 to each of the respective instruction windows.

As shown in fig. 2, a first router 250 having a number of execution pipeline registers 255 is used to route data from either of instruction windows 210 and 211 to one or more of functional units 260, which may include, but are not limited to, integer ALUs (arithmetic logic units) (e.g., integer ALUs 264 and 265), floating point units (e.g., floating point ALU 267), shift/rotate logic (e.g., barrel shifter 268), or other suitable execution units, which may include graphics functions, physical functions, and other mathematical operations. Data from functional unit 260 may then be routed through second router 270 to outputs 290, 291, and 292, routed back to an operand buffer (e.g., LOP buffer 242 and/or ROP buffer 243), or fed back to another functional unit, depending on the requirements for a particular instruction to be executed. The second router 270 may include: a load/store queue 275 that may be used to issue memory instructions; a data cache 277 that stores data being output from the cores to memory; and a load/store pipeline register 278.

The core also includes a control output 295 for indicating when execution of all instructions, for example, of one or more of instruction windows 210 or 211, has completed. When execution of an instruction block is complete, the instruction block is designated as "committed" and signals from control output 295 may in turn be used by other cores within block-based processor 100 and/or by control unit 160 to initiate scheduling, fetching, and execution of other instruction blocks. Both the first router 250 and the second router 270 may send data back to the instruction (e.g., as operands for other instructions within the instruction block).

As one of ordinary skill in the art will readily appreciate, the components within an individual core are not limited to those shown in fig. 2, but may vary depending on the requirements of a particular application. For example, a core may have fewer or more instruction windows, a single instruction decoder may be shared by two or more instruction windows, and the number and type of functional units used may vary depending on the particular target application for the block-based processor. Other considerations that apply when selecting and allocating resources with an instruction core include performance requirements, energy usage requirements, integrated circuit chips, processing technology, and/or cost.

As will be readily apparent to one of ordinary skill in the art, tradeoffs in processor performance may be made through the design and allocation of resources within the instruction window (e.g., instruction window 210) and control logic 205 of the processor core 110. The area, clock cycles, capabilities, and limitations substantially determine the performance of the individual cores 110 and the throughput of the block-based processor cores 110.

Instruction scheduler 206 may have different functions. In some higher performance examples, the instruction scheduler is highly concurrent. For example, each cycle (one or more) decoder writes the decoded ready state of an instruction and the decoded instruction into one or more instruction windows, selects the next instruction to issue, and in response back-end sends a second ready event — either any target ready event targeting the input slot of a particular instruction (predicate, left operand, right operand, etc.) or a broadcast ready event targeting all instructions. The ready-per-instruction status bit along with the decode ready status may be used to determine that the instruction is ready to issue.

In some examples, instruction scheduler 206 is implemented using a storage device (e.g., a first-in-first-out (FIFO) queue, a Content Addressable Memory (CAM)) that stores data indicative of information used to schedule execution of instruction blocks in accordance with the disclosed techniques. For example, data regarding instruction dependencies, passing of control, speculation, branch prediction, and/or data loads and stores are arranged in storage to facilitate determinations of mapping instruction blocks into processor cores. For example, instruction block dependencies may be associated with tags that are stored in a FIFO or CAM and subsequently accessed by selection logic used to map instruction blocks to one or more processor cores. In some examples, instruction scheduler 206 is implemented using a general purpose processor coupled to a memory configured to store data for scheduling instruction blocks. In some examples, instruction scheduler 206 is implemented using a special-purpose processor or using a block-based processor core coupled to memory. In some examples, instruction scheduler 206 is implemented as a finite state machine coupled to a memory. In some examples, an operating system executing on a processor (e.g., a general purpose processor or block-based processor core) generates priorities, predicates, and other data that can be used, at least in part, to schedule instruction blocks with instruction scheduler 206. As those of ordinary skill in the art will readily appreciate, other circuit configurations implemented in integrated circuits, programmable logic, or other suitable logic may be used to implement the hardware for instruction scheduler 206.

In some cases, the scheduler 206 accepts events for target instructions that have not yet been decoded and must also prohibit reissuing of already issued ready instructions. The instruction may be non-predicated, or predicated (based on a true or false condition). A predicated instruction does not become ready until it is targeted by the predicated result of another instruction, and the result matches the predicated condition. If the adjacent predicates do not match, the instruction is never issued. In some examples, the predicated instruction may be speculatively issued and executed. In some examples, the processor may then check that the speculatively issued and executed instructions are correctly speculated. In some examples, certain transitive closures of misspeculatively issued instructions and instructions in blocks that consume their output may be re-executed, or negated by the side effects of misspeculation. In some examples, the discovery of misspeculated instructions results in a complete rollback and re-execution of the entire instruction block.

V.Example instruction Block flow

Turning now to diagram 300 of FIG. 3, a portion 310 of a block-based instruction stream is illustrated, including a number of variable-length instruction blocks 311 and 315 (A-E). The instruction stream may be used to implement a user application, a system service, or any other suitable use. In the example shown in fig. 3, each instruction block starts with an instruction header, which is followed by a different number of instructions. For example, the instruction block 311 includes a header 320 and twenty instructions 321. The particular instruction header 320 illustrated includes several data fields that partially control execution of instructions within an instruction block, and also allows for improved performance enhancement techniques including, for example, branch prediction, speculative execution, lazy evaluation, and/or other techniques. The instruction header 320 also includes an ID bit indicating that the header is an instruction header rather than an instruction. The instruction header 320 also includes an indication of the instruction block size. The instruction block size may be in a data block of instructions larger than one, e.g., the number of 4 instruction data blocks contained within the instruction block. In other words, the size of the block is moved by 4 bits in order to compress the head space allocated to the specified instruction block size. Thus, a size value of 0 indicates the smallest size instruction block, which is the block header followed by four instructions. In some examples, the instruction block size is expressed as a number of bytes, words, a number of n-word data blocks, an address offset, or using other suitable expressions for describing the size of the instruction block. In some examples, the instruction block size is indicated by a termination bit pattern in the instruction block header and/or the foot.

The instruction block header 320 may also include an execution flag indicating a particular instruction execution requirement. For example, branch prediction or memory dependency prediction may be disabled for certain instruction blocks, depending on the particular application.

In some examples of the disclosed technology, the instruction header 320 includes one or more identification bits indicating that the encoded data is an instruction header. For example, in some block-based processor ISAs, a single ID bit in the least significant bit space is always set to the binary value 1 to indicate the start of a valid instruction block. In other examples, different bit encodings may be used for the identification bit(s). In some examples, instruction header 320 includes information indicating the particular version of the ISA for which the associated instruction block is encoded.

The instruction block header may also include several block exit types for use in, for example, branch prediction, control flow determination, and/or bad jump detection. The exit type may indicate what the type of branch instruction is, for example: a sequential branch instruction that points to the next contiguous instruction block in memory; an offset instruction that is a branch of another block of instructions at a memory address calculated relative to the offset; a subroutine call, or a subroutine return. By encoding the branch exit type in the instruction header, the branch predictor may begin operation at least partially before a branch instruction within the same instruction block has been fetched and/or decoded.

Instruction block header 320 also includes a store mask that identifies the load store queue identifier assigned to the store operation. The instruction block header may also include a write mask that identifies the global register(s) to which the associated instruction block is to be written. The associated register file must receive a write to each entry before the instruction block can complete. In some examples, a block-based processor architecture may include not only scalar instructions, but also Single Instruction Multiple Data (SIMD) instructions, which allow operations with a larger number of data operands within a single instruction.

VI.Example Block instruction target encoding

FIG. 4 is a diagram 400 depicting an example of two portions 410 and 415 of C language source code and their corresponding instruction blocks 420 and 425 (in assembly language), which illustrates how a block-based instruction may explicitly encode its target. High-level C language source code may be converted to low-level assembly language and machine code by a compiler whose target is a block-based processor. The high-level language may abstract away many details of the underlying computer architecture so that a programmer may focus on the functionality of the program. Instead, the machine code encodes the program according to the target computer's ISA, so that it can be executed on the target computer using the computer's hardware resources. Assembly language is a human-readable form of machine code.

In the following example, assembly language instructions use the following nomenclature: "I [ < number > ] specifies the number of instructions within the instruction block, where for the instructions following the instruction header the number starts at zero and for each successive instruction the instruction number is incremented; the operation of the instruction (such as READ, ADDI, DIV, etc.) follows the instruction number; an optional value (such as the intermediate value 1) or a reference to a register (such as R0 for register 0) follows an operation; and optional target-following values and/or operations to receive instruction results. Each target may point to another instruction, to a broadcast channel to other instructions, or to a register that may be visible to another instruction block when the instruction block is committed. One example of an instruction target is T [1R ] targeting the right operand of instruction 1. One example of a register target is W [ R0], where the target is written to register 0.

In diagram 400, the first two READ instructions 430 and 431 of instruction block 420 point to the right (T [2R ]) and left (T [2L ]) operands, respectively, of ADD instruction 432. In the illustrated ISA, the read instruction is the only instruction read from the global register file; however, any instruction may target a global register file. When the ADD instruction 432 receives the results of two register reads, it will become ready and execute.

When a TLEI (test-less-than-immediate) instruction 433 receives its single input operand from the ADD, it will become ready and execute. The test then produces a predicate operand that is broadcast on channel one (B [1P ]) to all instructions snooped on the broadcast channel, which in this example are two predicated branch instructions (BRO _ T434 and BRO _ F435). The branch that receives the matching assertion will fire.

The dependency graph 440 of the instruction block 420 is also illustrated as an instruction node array 450 and its corresponding operand targets 455 and 456. This illustrates the correspondence between block instructions 420, corresponding instruction window entries, and the underlying data flow diagram represented by the instructions. Here, decoded instructions READ 430 and READ 431 are ready to issue because they have no input dependencies. When it is issued and executed, the values read from registers R6 and R7 are written into the right and left operand buffers of ADD432, which makes the left and right operands of ADD432 "ready". Thus, the ADD432 instruction becomes ready, issues to the ALU, executes, and the sum is written to the left operand of the TLEI 433.

In comparison, a conventional out-of-order RISC or CISC processor will use additional hardware complexity, power, area, and reduced clock frequency and performance to build dependency graphs at runtime. However, dependency graphs are statically known at compile time and EDGE compilers can directly encode producer-consumer relationships between instructions through the ISA, which frees the microarchitecture from dynamically redisdiscovering them. This can potentially enable simpler microarchitectures, reducing area, power and boost frequency and performance.

VII.Example Block-based instruction formats

FIG. 5 is a diagram illustrating a generalized example of instruction formats for an instruction header 510, general instructions 520, and branch instructions 530. The instruction header or each of the instructions is marked according to the number of bits. For example, the instruction header 510 includes four 32-bit words and is marked from its least significant bit (lsb) (bit 0) up to its most significant bit (msb) (bit 127). As shown, the instruction header includes a write mask field, a store mask field, a plurality of exit type fields, a plurality of execution flags field (X flags), an instruction block size field, and instruction header ID bits (the least significant bits of the instruction header).

The execute flag field may indicate a particular instruction execution mode. For example, when the flag is set, a "suppress branch predictor" flag may be used to suppress branch prediction for an instruction block. As another example, when the flag is set, a "suppress memory dependency prediction" flag may be used to suppress memory dependency prediction for an instruction block. As another example, when a block of instructions is committed, a "break after block" flag may be used to halt the instruction thread and raise an interrupt. As another example, a "break before block" flag may be used to halt an instruction thread and raise an interrupt when an instruction block header is decoded and before instructions of the instruction block are executed.

The exit type field includes data that may be used to indicate the type of control flow and/or synchronization instructions encoded within the instruction block. For example, the exit type field may indicate that the instruction block includes one or more of: a sequential branch instruction, an offset branch instruction, an indirect branch instruction, a call instruction, a return instruction, and/or an interrupt instruction. In some examples, a branch instruction may be any control flow instruction used to pass control flow between instruction blocks, including relative and/or absolute addresses, and using either a conditional or unconditional predicate. In addition to determining implicit control flow instructions, the exit type field may be used for branch prediction and speculative execution. In some examples, up to six exit types may be encoded in the exit type field, and correspondence between the field and the corresponding explicit or implicit control flow instructions may be determined by, for example, examining the control flow instructions in the instruction block.

The illustrated generic block instruction 520 is stored as a 32-bit word and includes an opcode field, a predicate field, a broadcast ID field (BID), a first target field (T1), and a second target field (T2). For instructions with consumers larger than the target field, the compiler may use the move instruction to build a fan-out tree, or it may assign a high fan-out instruction to the broadcast. Broadcast supports sending operands to any number of consumer instructions in the core over a lightweight network. The broadcast identifier may be encoded in the general block instructions 520.

Although the general instruction format outlined by general instruction 520 may represent some or all of the instructions processed by a block-based processor, those skilled in the art will readily appreciate that one or more of the instruction fields may deviate from the general format used for a particular instruction, even for particular examples of ISAs. The opcode field specifies the length or width of instruction 520 and the operation(s) to be performed by instruction 520, such as memory read/write, register load/store, add, subtract, multiply, divide, shift, rotate, system operation, or other suitable instruction. The predicate field specifies the condition under which the instruction will execute. For example, the predicate field may specify a value of "true," and the instruction will only execute if the corresponding condition flag matches the specified predicate value. In some examples, the predicate field specifies, at least in part, which is used to compare the predicates, while in other examples, execution is predicated on a flag set by a previous instruction (e.g., a previous instruction in an instruction block). In some examples, the predicate field may specify that the instruction will always or never be executed. Thus, the use of the predicate field may allow for more intensive object code, improved energy efficiency, and improved processor performance by reducing the number of branch instructions.

The target fields T1 and T2 specify the instruction to which the results of the block-based instruction are sent. For example, an ADD instruction at instruction slot 5 may specify that its computation results are to be sent to instructions at slots 3 and 10. Depending on the particular instruction and ISA, one or both of the illustrated target fields may be replaced by other information, e.g., the first target field T1 may be replaced by an intermediate operand, an additional opcode, specifying two targets, etc.

Branch instruction 530 includes an opcode field, a predicate field, a broadcast ID field (BID), and an offset field. The opcode and predicate fields are similar in format and function as described with respect to the general purpose instruction. The offset may be expressed in units of four instructions, thus extending the range of memory addresses over which branches may be executed. The predicates shown with general instructions 520 and branch instructions 530 may be used to avoid additional branches within an instruction block. For example, execution of a particular instruction may be predicated on the result of a previous instruction (e.g., a comparison of two operands). If the predicate is false, the instruction will not commit the value computed by the particular instruction. If the predicate value does not match the required predicate, the instruction is not issued. For example, a BRO _ F (assert false) instruction will issue whether it is sent a false predicate value.

It should be readily understood that as used herein, the term "branch instruction" is not limited to changing program execution to a relative memory location, but also includes jumps to absolute or symbolic memory locations, subroutine calls and returns, and other instructions that may modify the execution flow. In some examples, the execution flow is modified by changing the value of a system register (e.g., a program counter PC or an instruction pointer), while in other examples, the execution flow may be changed by modifying a value stored at a specified location in memory. In some examples, a jump register branch instruction is used to jump to a memory location stored in a register. In some examples, subroutine calls and returns are implemented using jump and link, respectively, and jump register instructions.

VIII.Example states of processor cores

FIG. 6 is a flow diagram illustrating an example of a progression of a state 600 of a computer core of a block-based processor. A block-based processor includes a plurality of processor cores that are collectively used to run or execute a software program. Programs may be written in a variety of high-level languages and then compiled for block-based processors using compilers that target block-based processors. The compiler may transmit code that when executed on a block-based processor will perform the functions specified by the high-level program. The compiled code may be stored in a computer readable memory accessible by the block-based processor. Compiled code may include a stream of instructions grouped into a series of instruction blocks. During execution, one or more of the instruction blocks may be executed by a block-based processor to perform the functions of a program. Typically, a program will include more instruction blocks than can be executed on a core at any one time. Thus, blocks of a program are mapped to respective cores, the cores perform work specified by the blocks, and then the blocks on the respective cores are replaced with different blocks until the program is complete. Some of the instruction blocks may be executed more than once (such as during a loop or subroutine of the program). An "instance" of an instruction block may be created for each time the instruction block is to be executed. Thus, each repetition of an instruction block may use a different instance of the instruction block. As the program runs, corresponding instruction blocks may be mapped to and executed on the processor cores based on architectural constraints, available hardware resources, and dynamic flow of the program. During execution of a program, the respective processor cores may transition through the progression of state 600 such that one core may be in one state and the other core may be in a different state.

At state 605, the state of the respective processor core may be unmapped. An unmapped processor core is a core that is not currently assigned to execute an instance of an instruction block. For example, the processor cores may be unmapped before the program begins execution on the block-based computer. As another example, a processor core may be unmapped after a program begins execution but not all cores are being used. In particular, instruction blocks of a program are executed at least in part according to the dynamic flow of the program. Some portions of the program may flow serially or sequentially as a whole (such as when a subsequent instruction block depends on a result from an earlier instruction block). Other portions of the program may have more parallel flow, such as when multiple blocks of instructions may be executed simultaneously without using the results of other blocks executed in parallel. Fewer cores may be used to execute programs during more sequential flows of programs and more cores may be used to execute programs during more parallel flows of programs.

At state 610, the state of the respective processor core may be a map. A mapped processor core is a core that is currently assigned to execute an instance of an instruction block or is reserved as part of a logical resource. Instruction blocks are in operation when they are mapped to a particular processor core. A block of instructions in execution is a block that targets a particular core of a block-based processor, and the block will either execute speculatively or non-speculatively on the particular processor core. In particular, the running instruction block corresponds to the instruction block mapped to the processor core in state 610-650. Blocks are executed non-speculatively when they are known during the block mapping period in which the program will use the work provided by executing the instruction block. Blocks are speculatively executed when the during-work blocks provided by executing instruction blocks are unknown to the mapping program to be used or not to be used. Speculatively executing blocks can potentially improve performance (such as when a speculative block is started earlier than if the block were to be started after or when the work of the block is known to be used). However, speculatively executing may potentially increase the energy used when executing the program (such as when speculative work is not used by the program).

Block-based processors include a limited number of homogeneous or heterogeneous processor cores. A typical program may include more instruction blocks than can fit onto a processor core. Thus, the respective instruction block of the program will typically share the processor core with other instruction blocks of the program. In other words, a given core may execute instructions in multiple different instruction blocks during execution of a program. Having a limited number of processor cores also means that the execution of the program can be stopped or delayed when all processor cores are busy executing a block of instructions and no new cores are available for dispatch. When a processor core becomes available, instances of instruction blocks may be mapped to the processor core.

The instruction block scheduler may assign which instruction block is to be executed on which processor core and when the instruction block is to be executed. The mapping may be based on various factors, such as target energy to be used for execution, the number and configuration of processor cores, current and/or previous use of processor cores, dynamic flow of programs, whether speculative execution is enabled, the confidence level at which speculative blocks are to be executed, and other factors. Instances of instruction blocks may be mapped to currently available processor cores (such as when no instruction block is currently executing thereon). In one embodiment, instances of an instruction block may be mapped to a processor core that is currently busy (such as when the core is executing a different instance of the instruction block), and instances of subsequent mappings may begin when the earlier mapped instances are completed.

At state 620, the state of the respective processor core may be a fetch. For example, the IF pipeline stage of the processor core may be active during the fetch state. Fetching a block of instructions may include transferring the block from memory (such as an L1 cache, an L2 cache, or main memory) to the processor core and reading the instructions from a local buffer of the processor core so that the instructions may be decoded. For example, instructions of an instruction block may be loaded into an instruction cache, buffer, or register of a processor core. Multiple instructions of an instruction block may be fetched in parallel (e.g., simultaneously) during the same clock cycle. The fetch state may be multiple periods long and may overlap with the decode (630) and execute (640) states when the processor core is pipelined.

When instructions in an instruction block are loaded onto a processor core, the instruction block resides on the processor core. An instruction block partially resides when some, but not all, of the instructions of the instruction block are loaded. The instruction block resides completely when all instructions in the instruction block are loaded. The instruction block will reside on the processor core until the processor core is reset or a different instruction block is fetched onto the processor core. In particular, the instruction block resides in the processor core when the core is in state 620-670.

At state 630, the state of the respective processor core may be decoded. For example, the DC pipeline stage of the processor core may be active during the fetch state. During the decode state, instructions in the instruction block are being decoded so that they may be stored in a memory store of the processor core's instruction window. In particular, instructions may be translated from relatively compact machine code into a less compact representation of hardware resources that may be used to control a processor core. The decode state may be a number of cycles long and may overlap with the fetch (620) and execute (630) states when the processor core is pipelined. After execution of an instruction block is decoded, it may be executed when all dependencies of the instruction are satisfied.

At state 640, the state of the respective processor core may be execution. During the execution state, instructions in the instruction block are being executed. In particular, the EX and/or LS pipeline stages of a processor core may be active during an execution state. Instruction blocks may be executed speculatively or non-speculatively. The speculation block may execute to completion or it may be terminated prior to completion (such as when it is determined that the work performed by the speculation block will not be used). When the instruction block is terminated, the processor may transition to an abort state. When it is determined that the work of the block is to be used (e.g., all register writes are buffered, all writes to memory are buffered, and the branch target is computed), the speculative block may be completed. The non-speculative block may execute to completion when, for example, all register writes are buffered, all writes to memory are buffered, and the branch target is computed. The execution state may be a number of cycles long and may overlap with the fetch (620) and decode (630) states when the processor core is pipelined. When the instruction block is complete, the processor may transition to a commit state.

At state 650, the state of the respective processor core may be either commit or abort. During commit, work of instructions of an instruction block may be committed atomically so that other blocks may use the work of instructions. In particular, the commit state may include a commit phase in which the local buffered architectural state is written to an architectural state that is visible or accessible by other processor cores. When the visible architecture state is updated, a commit signal may be issued and the processor core may be released so that another instruction block may be executed on the processor core. During the suspend state, the pipeline of the core may be stalled to reduce dynamic power dissipation. In some applications, the cores may be power gated to reduce static power dissipation. At the end of the commit/abort state, the processor core may receive a new block of instructions to be executed on the processor core, the core may be flushed, the core may be idle, or the core may be reset.

At state 660, it may be determined whether the instruction blocks resident on the processor core may be flushed. As used herein, instruction block flush or processor core flush means to enable the processor core to re-execute one or more instruction blocks resident on the processor core. In one embodiment, refreshing the core may include resetting an active ready state for one or more instruction blocks. Re-execution of an instruction block on the same processor core may be desirable when the instruction block is part of a loop or repeated subroutine or when a speculative block is terminated and is to be re-executed. The decision to flush may be made by the processor core itself (continuous reuse) or by outside the processor (discontinuous reuse). For example, the decision to flush may come from another processor core or a control core executing instruction block scheduling. There may be potential energy savings when instruction blocks are flushed on cores that have executed instructions as opposed to executing instruction blocks on different cores. Energy is used to fetch and decode instructions in an instruction block, but the refresh block can save most of the energy used in the fetch and decode states by bypassing these states. In particular, the refresh block may be restarted at the execution state (640) because the instruction has been fetched and decoded by the core. When the block is refreshed, the decoded instructions and the decode ready state may be maintained while the active ready state is cleared. The decision to refresh the block of instructions may occur as part of the commit operation or at a subsequent time. If the instruction block is not flushed, the processor core may be idle.

At state 670, the state of the respective processor core may be idle. Based on the number of processor cores that are active at a given time, the performance and power consumption of the block-based processor may potentially be adjusted or compromised. For example, if the speculation misprediction rate is high, performing speculative work on cores running in parallel may increase the speed of computation rather than increase power. As another example, assigning a new instruction block to a processor immediately after committing or aborting an earlier executed instruction block may increase the number of processors executing in parallel, but may reduce the chance of reusing instruction blocks residing on the processor core. Reuse may increase when a cache or pool of idle processor cores is maintained. For example, when a processor core commits a commonly used instruction block, the processor core may be placed in a free pool so that the core may be flushed the next time the same instruction block will be executed. As described above, flushing the processor core may save time and energy used to fetch and decode the resident instruction block. The instruction blocks/processor cores placed in the free cache may be determined based on static analysis performed by a compiler or dynamic analysis performed by an instruction block scheduler. For example, a compiler hint indicating potential reuse of an instruction block may be placed in the header of the block, and the instruction block scheduler may use the hint to determine whether the block will be free or reallocated to a different instruction block after committing the instruction block. When idle, the processor cores may be placed in a low power state to reduce, for example, dynamic power consumption.

At state 680, it may be determined whether a block of instructions residing on an idle processor core may be flushed. If the core is to be refreshed, the block refresh signal may be asserted and the core may transition to an execution state (640). If the core is not to be refreshed, the block reset signal may be asserted and the core may transition to an unmapped state (605). When a core is reset, the core may be placed in a pool with unmapped cores so that the instruction block scheduler may allocate new instruction blocks to the core.

IX.Example multicore processor architecture

The multi-core processor may include a plurality of block-based physical processor cores, where each physical core includes one or more sharable resources. The physical processor cores directly correspond to manufactured processor cores of the processor. FIG. 7 illustrates an example system including a processor having multiple physical processor cores. FIG. 8 illustrates an example of a block-based processor core including shareable and non-shareable resources. FIG. 9 illustrates example control registers of a block-based processor core. For example, control registers may be used to select which resources of a physical core are shared between cores and which physical cores are composited into a logical core. 10-12 illustrate various examples of multicore processors configured to cause resources to be shared among physical processor cores.

As described herein, a multi-core processor may be dynamically adapted to a given workload by: sharing resources among physical cores, combining physical cores into larger logical cores, and/or splitting logical cores into smaller logical or physical cores. The logical cores may include one or more physical cores that coordinate with each other to execute blocks of instructions within threads of a program. A thread is a control unit in a program in which instruction blocks are ordered according to the control flow of the thread. A thread may include one or more instruction blocks of a program. A thread may include a thread identifier to distinguish it from other threads, a program counter that references non-speculative instruction blocks of the thread, a logical register file to pass values between instruction blocks of the thread, and a stack local to the thread to store data (such as activation records). Programs may be multithreaded in that each thread may operate independently of the other threads. Thus, different threads may execute on different respective logical cores.

Multiple physical cores may be merged together at runtime to form a larger, more powerful logical core. As a specific example, the serial portion of the workload may be handled by composing multiple physical cores into a larger logical processor that performs like an aggressive superscalar architecture. Alternatively, when sufficient thread-level parallelism is available, the same large logical processor may be broken up or split into multiple logical processors, so each physical processor can work independently and execute instruction blocks from independent threads. Merging physical cores together is referred to as composing logical cores, while splitting logical cores is referred to as decomposing logical cores.

The logical cores may interleave accesses to registers and memory between the physical cores to give the logical cores the combined computational resources of all the physical cores being composited. For example, a logical core containing two physical cores may use additional bits of the address to select between the two physical caches of the core, effectively doubling the L1 cache capacity. The register files may be similarly interleaved.

Each instruction block of a thread is mapped to a single physical processor. When compounded, the architecture may use additional cores of the logical processor to execute speculative instruction blocks of a thread. Speculation can potentially improve performance on serial workloads by executing work earlier than would be performed using only non-speculative execution. When a non-speculative block commits, it may send a commit signal to all other cores in the logical processor, along with the exit branch address. The speculation blocks on the correct path may continue execution while blocks on the unused path may be aborted.

Composing a logic core may be desirable when the overhead of changing the configuration is less than the performance gain of a more efficient configuration. Compounding is always done at block boundaries and can be initiated by the runtime system. The runtime system may be distributed among the physical cores or may execute on a global control unit of the processor. The physical cores and shareable resources of the physical cores may be composited in a variety of different manners, each manner providing different tradeoffs in overhead and efficiency.

The full composition changes the number of physical cores in the logical core and changes the register file and cache mapping. Dirty cache lines may be written out to main memory lazily (lazily). The logical registers and cache locations are evenly distributed throughout the physical core. The cache lines are mapped to banks, where each bank corresponds to a cache of a particular physical core. Thus, a full composite may include a larger logical cache that is the sum of the cache capacities of all the physical cores of the logical core. Fig. 10 illustrates an example of full compositing.

Fast composition adds an additional physical core to the logical processor, but retains the same L1 data cache and register mapping, and does not write out dirty cache lines to main memory during composition. This results in a logical processor having a smaller data cache than would be possible with full composition, but enabling access to data already in the cache to still hit after composition. Fast compounding may be desirable for short term activity bursts where additional execution units are useful, but where the overhead of reconfiguring the cache may be greater than the overhead saved from a larger, more efficient cache configuration.

Resource composition adds additional physical resources to the logical core without adding all of the capabilities of the additional physical core. For example, a logical cache may be composed using a plurality of physical core caches. In particular, a thread may use a set of data that exceeds the cache size of the physical processor. By using caches from adjacent physical cores to compose a logical cache to increase the cache size available to a thread, the cache hit rate of a thread can potentially be improved. Non-cache resources of neighboring physical cores may be powered down to conserve energy. Fig. 11 and 12 illustrate examples of resource composition.

De-compounding removes physical cores or resources from a logical processor and the removed cores or resources may be powered down to conserve energy. Execution may continue on the remaining physical cores. De-composition may include flushing dirty lines of each cache discarded from the logical processor and updating the cache map. When a cache line is evicted, dirty cache lines in the remaining physical cores may be written back.

Turning to fig. 7, computing system 700 may include a processor 705, a memory 770, an input device 750, and an output device 740. The processor 705 may include a control unit 710, a plurality of physical block-based (BB) processor cores 720 and 727, an input/output (I/O) interface 730, and an optional memory controller and/or a level two (L2) cache 760. Control unit 710 includes logic to manage execution of the processor, such as code to execute all or a portion of runtime system 712. I/O interface 730 may be used to connect processor 705 to various input devices (such as input device 750) and various output devices (such as output device 740). I/O interface 730 may include one or more I/O bridges using one or more communication protocols such as Peripheral Component Interconnect (PCI), PCI-Express, Universal Serial Bus (USB), Accelerated Graphics Port (AGP), etc. In some examples, control unit 710, memory controller and L2 cache 760, and I/O interface 730 are implemented, at least in part, using one or more of the following: hard-wired finite state machines, programmable microcode, programmable gate arrays, or other suitable control circuitry. In some examples, control unit 710, memory controller and L2 cache 760, and I/O interface 730 are implemented, at least in part, using a processor (e.g., an on-chip processor executing control code) and/or an external computer (e.g., an off-chip processor executing control code and in communication with processor 705 via a communication interface (not shown)).

Memory 770 may include stored instructions of a program. For example, a program may include instruction blocks A-E. The processor 705 may communicate with the memory 770 via a memory controller and the L2 cache 760 to retrieve or load all or a portion of a program onto one or more of the physical processor cores 720 and 727. In particular, memory controller and L2 cache 760 may be used to generate control signals for communication with memory 770 and to provide temporary storage of information from or to memory 770. Alternatively, the physical processor cores 720-727 may communicate directly with the memory 770.

Control unit 710 includes logic to manage execution of the processor, such as code to execute all or a portion of runtime system 712. For example, runtime system 712 may include code for: a memory that allocates and organizes instructions for a program; allocating, deallocating and organizing memory for data of a program; allocate, deallocate and schedule the physical BB processor core 720 for executing the program and 727; allocating, deallocating and scheduling the sharable resources of the physical BB processor core 720-727; and providing an interface between the program and the operating system. For example, the runtime system 712 may include an instruction block scheduler to allocate the physical BB processor core 720 and 727 so that one or more programs may be executed on the processor 705. For example, the instruction block scheduler may allocate an instruction block of a program to one or more of the physical BB processor cores 720-727 that are idle. While the program is being executed, the program's instruction blocks may be allocated to the physical BB processor core 720-727, so only a portion of the program's instruction blocks may reside on the physical BB processor core 720-727 at any given time.

As a specific example, a short program may include a single control thread that contains instruction blocks A-E. The instruction block scheduler may allocate a single logical processor to execute a thread. A logical processor may include a single physical processor core or multiple physical processor cores. Multiple instruction blocks may be scheduled to a given physical processor core. For example, a physical processor core may have storage for up to 128 decoded instructions, which may be further divided into instruction block slots or instruction windows having storage for up to 32 decoded instructions. Thus, a given physical processor core may execute one to four instruction blocks sequentially or simultaneously, with different instruction blocks being executed using different respective instruction windows of the given physical processor core.

Threads execute according to the control flow of the thread. For example, a thread that includes instruction blocks A-E may have a dependency graph 780. The control flow through a thread may differ depending on the input to the program. Thus, during one run of the program, instruction block ABDE may be executed, and during another run of the program, instruction block ACDE may be executed. In this example, only one of instruction blocks B and C will be executed during a given run of the program. Non-speculative execution of a thread may include executing a block of instructions of the thread serially according to a control flow, wherein a next block of instructions of the thread is not started until a next block of instructions of a known thread. Thus, when a thread is only being non-speculatively executed, instruction block B or C will not begin execution until instruction block A is committed or until the target address from block A (e.g., the address of block B or C) is known. However, before the target address from block A is known, the thread can potentially be accelerated by speculatively executing one or more of blocks B and C. For example, only block B, only block C, or both blocks B and C may begin to execute speculatively before the target address from block a is known. Blocks D and E may also begin executing speculatively before the target address from block a is known. Speculative execution of instruction blocks may occur on the same physical core (e.g., in different instruction windows) or on different physical cores that are part of a logical processor executing a thread. As one example, instruction block a may be non-speculatively executed on physical processor core 720 while instruction blocks B, D and E are speculatively executed on physical processor cores 721, 722, and 723, respectively. When the target address from block a is known, blocks on the correct path may continue execution, while blocks on the unused path may be aborted.

The logical processors for executing the threads of the program may include one or more of the physical processor cores 720-727. The physical processor core 720-727 has a layout or orientation within the processor 705. For example, the processor 705 may be a single Integrated Circuit (IC) and the physical processor cores 720-727 may be fabricated over an area of the IC, where each of the physical processor cores 720-727 do not overlap. As shown, the physical processor cores 720-727 may be arranged in a 2 × 4 array having two columns and four rows. However, the processor may include various numbers of physical BB processor cores arranged in various topologies. Although a 2 x 4 array is chosen for purposes of illustration, 1 x 4, 2 x 2, 4 x 4, 8 x 8, 4 x 8, etc. are possible design choices for the physical BB processor core. The physical cores of the logical core may coordinate with each other via control signals running between the physical cores. It is therefore desirable that the physical cores selected for the logical cores be physically close to each other so that control signals can propagate between the cores in a shorter time than if the cores were physically further apart.

As described herein, the particular physical processor core 720-727 to be assigned to a given logical core may be assigned independently of its physical proximity or based on the topology of the processor 705 and additional criteria. As one example, runtime system 712 may compose a logical processor using any two free physical cores, such as physical processor cores 720 and 727. Physical processor cores 720 and 727 may be physically adjacent or physically separate (as shown). As another example, physical processor cores that can be part of a given logical processor may be constrained based on the topology of the processor 705 and the number of physical processor cores of the logical processor. As a specific example, any logical processor consisting of two cores may consist of only physical processor cores within a row, such as illustrated by dashed lines 791-794. Additionally or alternatively, any logical processor consisting of four cores may be composited only by physical processor cores within a quad (quad), such as illustrated with dashed lines 795-796. Additionally or alternatively, a logical processor may be composed of all physical processor cores, such as illustrated with dashed line 797. It should be noted that the physical processor cores that make up a logical processor may be constrained in a variety of different ways. For example, pairs and/or quads of physical processor cores may be grouped in columns.

Some of the resources of an individual physical processor core may be shared by other physical processor cores within the logical processor. For example, each of the physical processor cores may include a level one (L1) data cache, and all physical processor cores of the logical core may share the L1 data cache, such that the logical processor effectively has a larger L1 data cache than any individual physical processor core. Additionally or alternatively, some resources of individual physical processor cores may be shared by other physical processor cores external to the logical processor. Additionally or alternatively, other resources of the individual physical processor cores may be private and not shared by other physical processor cores external to the individual physical processor cores.

Fig. 8 illustrates an example of a physical block-based processor core 800 that includes shareable resources and non-shareable resources. Processor core 800 may include control logic 810, execution units 820, register file 830, instruction window 840 & 841, load store queue 850, and L1 data cache 860. In one embodiment, the shareable resources may include execution units 820, register file 830, and L1 data cache 860. The non-shareable resources may include control logic 810, instruction window 840 and 841, and load store queue 850. In alternative embodiments, shareable and non-shareable resources may be partitioned differently. In some examples, control logic 810, execution units 820, register file 830, instruction window 840 & 841, load store queue 850, and L1 data cache 860 are implemented, at least in part, using one or more of: hard-wired finite state machines, programmable microcode, programmable gate arrays, or other suitable control circuitry.

Control logic 810 may include logic to control the operation of physical block-based processor core 800. For example, control logic 810 may include header decode logic 811, instruction scheduler 812, configuration state 813, branch predictor 814, retirement logic 815, core interface logic 816, and power management logic 818. Header decode logic 811 may be used to decode instruction block headers for instruction blocks to be executed on processor core 800. Instruction scheduler 812 may be used to schedule instructions of instruction blocks executing on processor core 800. For example, instructions may be scheduled in dataflow order when operands become available for the respective instructions. Configuration state 813 can include registers for programming processor core 800 to be associated with a given logical processor and/or to share one or more shareable resources with another processor core, such as described with reference to fig. 9. The registers may be mapped to memory in the visible architectural state of processor core 800 such that the values of the registers are only changed during the commit phase of the instruction block written to the registers. In particular, the value of the register may be changed at the commit phase of an instruction block executing on a different processor core than processor core 800.

Branch predictor 814 may predict across instruction blocks by predicting branch exit addresses for each instruction block. The branch predictor 814 may use various algorithms to speculate across instruction blocks, such as by using sequential block prediction (e.g., assuming the next sequential instruction block is taken) or look-ahead branch prediction (e.g., using a fetch table buffer). Predicting the branch exit address allows a block of instructions to be fetched and speculative execution to begin before the current block is completed. The oldest instruction block is marked as non-speculative and the branch exit address can be predicted. If there is a window of available instructions, the predicted address (corresponding to the predicted block of instructions) may be retrieved and the predicted block of instructions may begin executing on another physical core in the logical processor or on the same physical core (e.g., processor core 800).

The branch address taken may be resolved before the instruction block is completed. In this case, the non-speculative block may use the taken branch signal to inform other cores in the logical processor of the taken address. The oldest instruction block then becomes the non-speculative block. Any blocks that are not correctly speculated may be aborted. The branch signal used is different from the commit signal. The updated register and memory values are invalidated until after the commit signal when the taken branch allows the next block to continue speculation and begin fetching new instruction blocks.

Branch predictor 814 may speculate within an instruction block by predicting a predicate value and a control flow path through the instruction block. Three different types of speculation may be used within an instruction block, including predicate speculation, memory speculation, and load speculation. Predicate speculation may be used to predict the value of a predicate. Memory speculation involves loading from the L1 cache values for speculative blocks that may be changed by speculatively smaller blocks. Load speculation may include using a load store queue to allow loads to execute before stores with lower load store identifiers. In all three cases, a mis-speculation may result in the re-execution of the entire instruction block. For example, the valid bits in all operand buffers may be reset and the zero operand instruction may be reloaded.

Retirement logic 815 may monitor the progress of the execution block and manage the commit phase of the non-speculative instruction block. Due to the predicted execution, an instruction block may complete without executing all instructions in the block. When one (and only one) branch has been executed, the instruction block may complete and all instructions that modify the external state (register writes and stores) have been executed. For example, the compiler may encode register write and store identifiers in the instruction block header so that retirement logic 815 may identify when all register writes and stores are complete. During execution, the instructions of the block do not modify the visible architectural state. Instead, all changes are cached and committed together in the commit phase. After processor core 800 enters the commit phase for the instruction block, the logical register file is updated with all register writes and all stores in the load store queue are sent to the logical L1 data cache starting with the lowest sequence identifier. As described further below, the logical register file may include register file 830 and/or register files from other physical processor cores. Similarly, logical L1 data caches may include an L1 data cache 860 and/or L1 data caches from other physical processor cores. After all register writes and stores have committed, retirement logic 815 may send a commit signal to all other physical cores in the same logical processor. Only non-speculative blocks may be committed and the speculative blocks will wait to commit until the speculative blocks are the oldest blocks and thus no longer speculative.

Core interface logic 816 provides logic for communicating with other physical processor cores. For example, core interface logic 816 may be used to communicate with other physical processor cores within a logical processor using control interface 817. Control interface 817 may include, for example, a collection of control signals, a bus, or a Network On Chip (NOC). Control interface 817 may include branch signals employed, target address signals, commit signals, and various other signals for communicating between physical processor cores.

Power management logic 818 may include logic to dynamically and selectively reduce power consumption by processor core 800. For example, power management logic 818 may include logic to place one or more components of processor core 800 into a low power mode. The low power mode may include clock gating, voltage scaling, frequency scaling, and/or power gating of the respective components. Power management logic 818 may include logic to wake up components of processor core 800 that are in a low power mode. Depending on performance and power targets, various different quantities of processor core 800 may be selected to be placed in a low power mode. For example, the shareable and non-shareable resources of processor core 800 may be fabricated on different voltage islands and/or with different clock trees. The non-shareable resources may be on one voltage island and clock tree, and different shareable resources may have different respective voltage islands and clock trees. As a specific example, shareable resources may include execution units 820, register file 830, and L1 data cache 860. Each shareable resource may have a different voltage island and/or clock tree such that each shareable resource may be independently placed in a low power mode, such as by programming registers of power management logic 818. Alternatively, a majority of processor core 800 may be in a low power mode while one or more shareable components are in a fully operational mode. As a specific example, a large portion of processor core 800 may be in a low power mode while only L1 data cache 860 is powered on. A large portion of processor core 800 may include all shareable and non-shareable resources, except for exclusively powered-on resources and power management logic 818. Power management logic 818 is powered on even when the rest of processor core 800 is in a low power mode so that power management logic 818 can respond to wake requests.

The L1 data cache 860 may be used to temporarily store program data. The L1 data cache 860 may include a Content Addressable Memory (CAM) and logic for accessing and managing the contents of the CAM. The L1 data cache 860 may be arranged in rows, where each row corresponds to a contiguous range of addresses in main memory. The L1 data cache 860 is typically smaller than main memory, so the L1 data cache 860 may hold a subset of the data in main memory. The subset of data is tracked by storing a reference to an address associated with each row of data from main memory stored in L1 data cache 860. The address 870 may be divided into different fields to provide a mapping of addresses in memory to lines stored in the L1 data cache 860. For example, the address 870 may be divided into a tag field 872, an optional index field 874, and an offset field 876. When the contents of memory are loaded into a line of the L1 data cache 860, the tag field 872 is stored in the L1 data cache 860. The tag field references a memory location from which data is read and/or to be written back. Offset field 876 is the address of the data within the row. For example, if the cache line is 1024 bits, there are 128 bytes and 32-bit words in the cache line. Offset field 876 may distinguish between different bytes and/or words within a cache line when less than the entire cache line is referenced. The index field 874 may be used to map cache lines to banks and/or sets of the logical cache. When the logical cache includes only a single direct-mapped cache, the index field 874 is not used because there is only one bank. However, when the logical cache includes multiple physical caches, the index field 874 is used to distinguish between different banks of the logical cache.

The L1 data cache 860 may be sharable with other physical processor cores. As one example, the L1 data cache 860 may be shared with other physical processor cores of a logical processor. Additionally or alternatively, the L1 data cache 860 may be shared with other physical processor cores that are not part of a logical processor. When the L1 data cache 860 is shared with other physical processor cores, the L1 data cache 860 is part of a larger logical cache. For example, the L1 data cache 860 may be a library of logical caches. As described further below, each physical cache may be assigned a library identifier corresponding to the value of the index field 874. The L1 data cache 860 is used in responding to memory accesses when the address to memory index field bits hit or match the bank identifier of the L1 data cache 860.

The L1 data cache 860 may communicate with other processor cores using a cache interface 862. The cache interface 862 may include read/write enable, address signals, and data signals. For example, the L1 data cache 860 may use the cache interface 862 to provide data to another processor core that issues requests for data. The L1 data cache 860 is in a visible architectural state, and therefore is updated when an instruction block commits. In particular, intermediate results for an instruction block may be cached in load store queue 850 prior to being committed to L1 data cache 860. In some embodiments, load store queue 850 may be accessed when L1 data cache 860 is shared with other processor cores.

The execution unit 820 may be used to execute instructions of an instruction block. Execution unit 820 may include one or more of an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a barrel shifter, a multiplier, a graphics accelerator, a physical accelerator, and various other logical functions for performing mathematical and/or logical operations of instructions. Decoded instructions of an instruction block and operands of the instructions may be stored in the instruction window 840-841. The instructions are executed in part by: the operands of the instruction are routed to the appropriate execution unit 820 and the output of the execution unit 820 is stored in the operand buffers, load store queue 850, and/or register file 830 of another instruction. Under the dataflow execution model of block-based processor core 800, an instruction may potentially issue or execute once the instruction is decoded and all operands are available to the instruction. However, the number of instructions that can be executed in parallel may be limited by the number of execution units 820 available on processor core 800. As a specific example, if there are four ALU instructions ready for execution, but processor core 800 can only access two ALUs, then only two ALU instructions can be issued in parallel in the same cycle. However, execution units 820 may be shared among physical processor cores to form a logical group of execution units. The issue logic of instruction scheduler 812 may select individual execution units from the logical group of execution units to execute the instructions of the instruction block. For example, execution unit interface 822 may be used to pass control signals, operands, and results between physical cores. In particular, operands may be routed from the instruction window of the first physical processor core to the execution unit of the second physical processor core using the execution unit interface 822. Execution unit interface 822 may be used to route results from the execution unit of the second physical processor core back to the first physical processor core. Executing instructions on non-local processor cores may increase latency to account for communication between physical processor cores, so instruction scheduler 812 may prioritize execution of instructions on local physical processor cores.

Register file 830 is a visible architectural state that may be used to store data passed between instruction blocks. The register file 830 may include a buffer or shadow register file for storing intermediate results of instruction blocks, and the register file 830 may be updated during the commit phase. The register file 830 may be shared among physical processor cores to form a logical register file. The individual registers of register file 830 are numbered or addressed similar to locations in memory. Each physical register file may be a bank of logical register files, and thus an index within a register number may be used to determine which bank is referenced. Register file interface 832 may be used to communicate control signals, addresses, and data to register files on different physical processor cores.

FIG. 9 is a diagram illustrating example control registers of a physical block-based processor core.

The control registers may be used to program which resources of the physical processor core are shared with other physical processor cores. The control registers may be used to program which physical processor cores may share resources among each other. The control registers are visible architectural states that can be programmed at runtime of the program. In particular, the control register is updated during the commit phase of the instruction block. The control registers may be accessed in a variety of ways, including one or more of: general memory mapped loads and stores; specific instructions for reading and writing control registers; protected (e.g., such as by privileged or kernel mode) memory mapped loads and stores; and scanning for incoming commands through the test access port. In some examples, the control register is implemented at least in part using one or more of: flip-flops, latches, register files, Static Random Access Memory (SRAM), dynamic random access memory, Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other suitable memory elements.

The composite control register 900 of a given physical processor core may be used to program which resources of the given physical processor core are shared with other physical processor cores. For example, a Master Control Register (MCR) field 901 may be encoded to indicate resources that a given physical processor core shares with other physical cores. The MCR field 901 may include multiple bits to enable sharing of various resources, to indicate that a given physical core is part of a logical core, and/or to power down a majority of a given physical processor core or a particular resource of a given physical processor. One example of an encoding for the MCR field 901 is illustrated in table 910. In this example, the least significant bit (bit 0) may be used to enable sharing of the L1 data cache; bit 1 may be used to enable sharing of the register file; bit 2 may be used to enable sharing of execution units; bit 3 may be used to indicate that a given physical processor core is part of a larger logical processor core; and the most significant bit (bit 4) may be used to indicate a default power state for a given physical processor core, where a "0" indicates that the default state is a low power mode and a "1" indicates that the default state is powered up. Multiple bits may be set simultaneously to share multiple resources. Specific examples of the encoding of the MCR field 901 include: the "00000" encoding may force a majority of a given physical processor core including the shareable resources of the core to power down; the "10000" encoding may power up a given physical processor core and keep shareable resources private in the given physical processor core; the "00001" encoding may enable the L1 data cache to be shared among other physical processor cores and to power down the rest of a given physical processor core; the "00010" encoding may enable the register file to be shared among other physical processor cores and to power down the rest of a given physical processor core; the "00100" encoding may enable the execution unit to be shared among other physical processor cores and to power down the rest of a given physical processor core; the "11001" encoding may indicate that a given physical processor core is part of a larger logical processor core and enables the L1 data cache to be shared among other physical processor cores; and "11000" encoding may indicate that a given physical processor core is part of a larger logical processor core and that the L1 data cache may be powered down. It should be noted that these encodings are chosen for illustrative purposes, and different various encodings with more or fewer bits are possible.

The composition control register 900 may include an optional core number field 902 for programming the number of physical processor cores that may share resources among each other. As one example, the core number field may be programmed with any number up to the maximum core number of the processor. In another example, the number of cores field may include separate bits corresponding to different quorum physical cores that may be grouped. As a specific example, physical cores may be grouped only by powers of 2, such that the index field of an address is fully mapped to a core. Thus, groupings of one, two, four, and eight physical processor cores may be allowed for a processor that includes eight physical processor cores. For example, the number of cores field 902 may include a first bit to group two physical processor cores, a second bit to group four physical processor cores, and a third bit to group eight physical processor cores. Alternatively, the value programmed in the number of cores field 902 may indicate a power of two cores that are grouped, so a value of 3 may indicate that eight cores are grouped together.

The value programmed in the number of cores field 902 and the relative physical location of a given physical processor core may be used to compose a logical processor and/or logical resource. For example, sharing of resources may be constrained based on the topology of the processors, such as described with reference to fig. 7. In particular, the grouping of two physical cores may be constrained to one row of the processor, and the grouping of four physical cores may be constrained to a quad of the processor. The MCR field 901 may determine whether the grouping of processor cores corresponds to a logical core and/or a logical resource. For example, if a bit (e.g., bit 3) of MCR field 901 is set, then the given physical processor core is part of a larger logical processor core. However, if the bit of MCR field 901 is clear and another bit corresponding to a sharable resource is set, the resource may be shared without treating the given physical processor core as part of a larger logical processor core.

The MCR fields of different processor cores within a grouping of physical processor cores may be programmed differently such that different processor cores share resources differently. For example, a logical processor may include a first processor and a second processor. The first processor may share its L1 cache and the second processor may disable its L1 cache.

Composite control register 900 may be a processor-level register rather than a processor core-level register. For example, composition control register 900 may be part of a processor level control unit. Composite control register 900 may include a separate MCR field for each respective processor core of the processor. For example, complex control register 900 may include MCR0 field 901 corresponding to processor core 0, MCR1 field 903 corresponding to processor core 1, and MCRn field 904 corresponding to processor core n, where the processor includes n processor cores. The values of processor-level complex control registers 900 may be communicated to individual processor cores in various ways. For example, the values of processor-level complex control registers 900 may be delivered to individual processor cores continuously via dedicated control signals. As another example, updates to processor-level complex control registers 900 may be communicated to individual processor cores via interrupts or messages sent over the network on chip.

In one embodiment, a given processor core may include a composition control register 900 and a composition topology register 920. Composition control register 900 may be used to indicate which resources of a given physical processor core are shared among a set of physical processor cores, and composition topology register 920 may be used to indicate which physical processor cores are part of the set of physical processor cores. For example, the composition topology register 920 may include a plurality of bits 921-923 that may be programmed to logically group the physical processor cores. As a specific example, bits 921 and 923 may represent a bitmap of processor cores that are part of a logical group. Bit C0921 may correspond to a first physical processor core, bit C1922 may correspond to a second physical processor core, and bit Cn 923 may correspond to an nth physical processor core, where the processor includes n physical processor cores. Thus, if bits C0921 and C1922 are the only bits set in complex topology register 920, then the first and second physical processor cores are part of a logical group. As a particular example, if the MCR field of the respective composite control register 900 indicates that the physical processor core is part of a larger logical processor core, then the first and second physical processor cores are part of a logical processor. Alternatively, if the MCR field of the respective composite control register 900 indicates that the physical processor cores only share their L1 cache, then the L1 caches of the first and second physical processor cores form a logical L1 cache, but the first and second physical processor cores are not part of the logical processor.

Bits 921 and 923 may be programmed such that the selected physical processor cores are sparsely distributed across the topology of the processor. However, it may be desirable to group the logical groups of physical processor cores closely to reduce communication latency between the physical processor cores. The physical processor core may include monitoring logic to analyze the programmed values of the complex topology register 920. As an example, the processor core may assert an interrupt or deny programming of a value that does not meet the proximity criteria of the packet.

Composite topology register 920 may include optional topology mode setting 924 for indicating how bits 921 and 923 are to be interpreted. For example, if topology mode setting 924 is programmed with a first value (e.g., "0"), bits 921 and 923 may be a bitmap of the physical processor core, as previously described. If topology mode setting 924 is programmed with a second value (e.g., "1"), bits 921-923 may be the number of physical processor cores for the group. The number of physical processor cores in combination with the relative location of the respective physical processor cores may be used to define the grouped physical processor cores. For example, each physical processor core may be grouped only in predefined rows, columns, quads, and so on.

In alternative embodiments, the control registers may be dedicated to a single shareable resource and used to program the accessibility of the shareable resource. For example, the shareable resource may be one or more of a cache, a register file, or an execution unit. As a specific example, the cache composition control register 930 may be used to program a grouping of physical processor cores that may share a cache of the respective physical processor core. The cache composition control register 930 may include a number of data cache banks field 931 for programming how many physical cache banks are used to create the logical cache. In one example, the number of physical cache banks and the relative physical location of the corresponding physical processor core within the processor topology are used to determine all of the physical cache banks of the logical cache. For example, each physical cache bank can only be grouped in predefined rows, columns, quads, etc. In another example, the logical cache may include a physical cache bank of any physical processor core from the processor. As a specific example, multiple identification fields such as core identification (core ID) field 932-935 may be used to identify an individual physical cache library. For example, a particular core identification field may be programmed with an address or identifier corresponding to a particular physical processor core. In particular, each physical processor core may be identified by a unique number and/or location (e.g., row and column) within the processor. The number of identification fields may vary from 0 to M, where M is the maximum number of cache banks allowed for the logical cache. M may be less than or equal to the number of physical cache banks on the processor.

The location of the core identification field in the cache complex control register 930 may be used to map the index field of the address to a particular physical cache bank of the logical cache. As a specific example, a logical cache may be configured with two physical cache banks by programming the number of data cache banks field 931 with a "2"; the core identification field 932 may correspond to an index field "0" and the core identification field 933 may correspond to an index field "1". As another example, a logical cache may be configured with four physical cache banks by programming the number of data cache banks field 931 with "4"; core identification field 932 may correspond to index field "00", core identification field 933 may correspond to index field "01", core identification field 934 may correspond to index field "10", and core identification field 935 may correspond to index field "11". Similarly, additional core identification fields may be added to map more index field bits to a larger logical cached physical cache bank.

Processor topology registers 940 may provide topology information of the processor to the runtime system. The runtime system may use the topology information to make decisions about how to group physical processor cores and/or sharable resources into logical cores and/or resources. Processor topology register 940 may use row field 942 to indicate the number of rows of the physical processor core. Processor topology register 940 may use column field 944 to indicate the number of columns of physical processor cores. The number of rows and columns reported by processor topology registers 940 may be fixed when the processor is designed and provided by read-only memory. Processor topology register 940 may use processor type identifier field 946 to indicate a unique identifier for the processor type. For example, during an initialization sequence, the runtime system may read the processor type identifier field 946 to determine which type of processor is executing the runtime system, and based on the value read from the processor type identifier field 946, topology information for the processor may be determined. For example, topology information for different processor types may be stored in tables associated with different respective processor type identifiers. Thus, by providing a unique identifier for the processor type, topology information can be determined using software (e.g., reading a table) and by reading a value hard-coded in processor topology registers 940.

10-12 illustrate examples of multicore processors configured such that resources are shared among physical processor cores. FIG. 10 illustrates an example of multiple logical processors executing multiple threads of a program. FIG. 11 illustrates an example of a logical cache including multiple physical cache banks. FIG. 12 illustrates an example of multiple logical processors executing multiple threads of a program and using a shared logical cache.

In FIG. 10, processor 1000 includes eight physical block-based processor cores 1020, 1030, 1040, 1050 and 1080-1083 arranged in a two-column four-row array. Processor 1000 is illustrated during execution of a program having multiple threads including thread 0 and thread 1. Each thread may include one or more blocks of instructions that may be executed on a logical processor of processor 1000. For example, thread 0 may be executed on logical processor 1010, which includes physical processor cores 1020 and 1030; and thread 1 may be executed on logical processor 1060, which includes physical processor cores 1040 and 1050. A logical processor may contain a runtime system that programs the control registers and/or physical block-based processor cores of processor 1000. For example, each logical processor may be configured by programming the respective composition control register and composition topology register of the physical processor core. As a specific example, the logical processor 1010 may be configured in the following manner: programming the composition control registers of the physical processor cores 1020 and 1030 such that each physical processor core is composited as part of a logical processor and such that the L1 data cache is shared; and programming the composition topology registers of the physical processor cores 1020 and 1030 such that the physical processor cores 1020 and 1030 are grouped together. Thus, the physical processor cores 1020 and 1030 may be composited into a logical processor 1010 with a logical cache 1015. The L1 data cache of the physical processor core 1020 may be a first bank of the logical cache 1015 and the L1 data cache of the physical processor core 1030 may be a second bank of the logical cache 1015. Grouping the physical processor cores 1020 and 1030 into the logical processor 1010 may be specified by programming only the number of physical processor cores in the complex topology register of the physical processor cores 1020 and 1030. Alternatively, grouping the physical processor cores 1020 and 1030 into the logical processor 1010 may be specified by programming a bitmap of the physical processor cores in a complex topology register of the physical processor cores 1020 and 1030.

Similarly, logical processor 1060 may be configured by: programming the composition control registers of the physical processor cores 1040 and 1050 so that each physical processor core is composited as part of a logical processor and so that the L1 data cache is shared; and programming the composition topology registers of the physical processor cores 1040 and 1050 so that the physical processor cores 1040 and 1050 are grouped together. Thus, physical processor cores 1040 and 1050 may be composited into a logical processor 1060 with a logical cache 1065. The L1 data cache of physical processor core 1040 may be a first bank of logical cache 1065 and the L1 data cache of physical processor core 1050 may be a second bank of logical cache 1065.

After logical processors 1010 and 1060 are configured, threads 0 and 1 may execute independently on the respective logical processors. For example, one instruction block of thread 0 may execute non-speculatively on a physical processor core (e.g., physical processor core 1020), and a different instruction block of thread 0 may execute speculatively on a different physical processor core (e.g., physical processor core 1030). When a non-speculative instruction block is committed, the original speculatively executed block may become non-speculative if the block is correctly predicted and if the block is the oldest block executed within the thread. Instruction blocks executing on one physical processor core may access a logical cache distributed across the two physical processor cores 1020 and 1030. Thus, a logical cache may provide a larger cache than a physical cache that uses only a single physical processor core.

The physical block-based processor cores 1080-1083 may be powered down to save power. For example, each of the physical processor cores 1080-1083 may be powered down by programming the respective complex control register or other power control register of the respective physical processor core. Powering down the physical processor cores 1080-1083 may include placing the physical processor cores 1080-1083 in a low power mode. For example, the physical processor core 1080-1083 may be clock-gated, clocked at a reduced frequency, powered at a lower voltage, or shut down.

Fig. 11 illustrates an example of a processor 1100 including eight physical block-based processor cores arranged in a two-column four-row array. Processor 1100 is illustrated during execution of a single thread of a program. A thread may include one or more instruction blocks that may be executed on physical processor core 1120. The physical processor core 1120 may access a logical data cache 1140 that extends beyond the physical processor core 1120. Logical data cache 1140 may contain a runtime system that programs control registers and/or physical block-based processor cores of processor 1100. As a specific example, logical data cache 1140 may be configured in the following manner: programming the composition control register of the physical processor core 1120 so that the L1 data cache is shared and the physical processor core is powered up without being composited; programming the composition control registers of the physical processor core 1130 so that most of the physical processor core is powered down and the L1 data cache is powered up and shared; and programming the composition topology registers of physical processor cores 1120 and 1130 such that physical processor cores 1120 and 1130 are grouped together. Thus, physical processor cores 1120 and 1130 may be grouped such that physical processor core 1120 may execute instruction blocks of threads using logical data cache 1140, logical data cache 1140 doubling the size of the physical data cache of a single physical processor core. This may result in higher performance by a lower miss rate in the data cache than if a single physical processor core were utilized to execute the thread, and may result in lower power for executing the thread than if both physical processor cores 1120 and 1130 were caused to execute the thread.

Fig. 12 illustrates an example of a processor 1200 comprising eight physical block-based processor cores arranged in a two-column four-row array. A processor 1200 is illustrated during execution of a program having a plurality of threads, including thread 0 and thread 1. Each thread may include one or more instruction blocks that may be executed on a logical processor of processor 1200. For example, thread 0 may be executed on a logical processor that includes physical processor core 1220; and thread 1 may be executed on a logical processor that includes physical processor core 1230. Logical processors and threads may share a logical data cache 1240. This may be beneficial, for example, when different threads read from a common data set. The logical processors and logical caches may contain runtime systems that program control registers and/or physical block-based processor cores of processor 1200. As a specific example, the processor configuration may be configured in the following manner: programming the composition control register of the physical processor core 1220 such that the L1 data cache is shared and the physical processor core 1220 is powered on and not compounded; programming the composition control register of physical processor core 1230 such that the L1 data cache is shared and physical processor core 1230 is powered up and not compounded; and programming the composition topology registers of physical processor cores 1220 and 1230 such that physical processor cores 1220 and 1230 are grouped together. Thus, physical processor cores 1220 and 1230 may execute independent threads that share logical cache 1240. This may result in higher performance by a lower error rate in the data cache than executing threads using a single data cache within the physical processor core.

X.Example method of programming a processor

Fig. 13 and 14 are flowcharts illustrating example methods of programming a processor to execute threads of a program using logical processors and/or logical resources. For example, the method may be performed at least in part on a control unit of a processor or on a physical block-based processor core of a processor by a runtime system executing software instructions. FIG. 13 is a flow diagram illustrating an example method 1300 for programming a processor to execute threads of a program using shared logical resources. For example, a processor may include a plurality of block-based physical processor cores including a first physical processor core and a second physical processor core. A respective physical processor core may include one or more shareable resources such as caches, register files, execution units, and/or various other logical or mathematical components.

At process block 1310, the composition control register of the first physical processor core may be programmed to configure a given resource of the first physical processor core to be shared with the second physical processor core. For example, the program may include a thread initialization instruction block, and the first physical processor core may be programmed during runtime of the program using instructions of the thread initialization instruction block. The thread initialization instruction block may be executed on any one of a plurality of physical processor cores or a control unit of the processor. The composite control register may be programmed by executing a store instruction on a memory-mapped address corresponding to the cache composite control register. System registers such as composition control registers and composition topology registers may be visible architectural state, and thus the values stored in the system registers may only be updated during the commit phase of the instruction block that updates the system registers. Thus, the values to be written to the compound control register when the store instruction is executed may be initially cached, and the compound control register may be updated during the commit phase of the thread initialization instruction block.

As described above, one or more of the shareable resources may be shared with other physical processor cores to create logical resources. The physical processor cores sharing the logical resources may have different relationships to each other. For example, the first processor core and the second processor core may be combined to form a logical processor core that executes a single thread of the program. Thus, logical resources may be shared within a single thread. As another example, the first processor core and the second processor core may be configured to execute different threads of a program. Thus, logical resources may be shared across different threads.

The composite control register or power control register of the first physical processor core may be programmed to configure the first physical processor core in a different power mode when a given resource of the first physical processor core is shared with the second physical processor core. For example, a first physical processor core may be in a low power mode. As a specific example, the resources of the first physical processor core may default to a low power mode unless certain resources are shared between different physical processor cores. Thus, by selecting only the cache of the first physical processor core to be shared, the cache may be powered up in an operating mode and the rest of the first physical processor core may be in a low power mode. By powering down portions of the first physical processor core, energy may be saved while executing the program.

At process block 1320, the composition topology register may optionally be programmed to configure a set of physical processor cores to share resources. For example, the set of physical processor cores may include a first physical processor core and a second physical processor core. The set of physical processor cores may be specified by providing a bitmap of the set or by providing the number of physical processor cores in the set. For example, a bitmap may be used to specify physical processor cores from anywhere on the processor, whether the physical processor cores are adjacent or non-adjacent to each other. As another example, providing the number of physical processor cores may form an automatic grouping based on the number in the group and the topology of the processors.

At process block 1330, execution of the instruction block may be initiated on the second physical processor core. The instruction block may include instructions to use a given sharable resource configured on the first physical processor core such that the instructions are executed on the second physical processor core using the sharable resource of the first physical processor core. For example, execution of the instruction block may be initiated on the second physical processor core by allocating the instruction block to the second physical processor core. In particular, addresses corresponding to the instruction block may be loaded into a program counter of the second physical processor core so that the second physical processor core may retrieve and execute the instruction block.

FIG. 14 is a flow diagram illustrating an example method 1400 for programming a logical processor to execute threads of a program. For example, a processor may include a plurality of block-based physical processor cores including a first physical processor core and a second physical processor core. A respective physical processor core may include one or more shareable resources such as caches, register files, execution units, and/or various other logical or mathematical components.

At process block 1410, the composite topology register may be programmed to compose a logical processor that includes a first physical processor core and a second physical processor core. For example, the composition topology register may be programmed during runtime of the program, such as by executing instructions of a thread initialization instruction block for a given thread. A logical processor may be used to execute non-speculative and speculative instruction blocks of a given thread of a program. The physical processor cores of a logical processor may be designated by providing a bitmap of the group or by providing the number of physical processor cores in the group. For example, a bitmap may be used to specify physical processor cores from anywhere on the processor, regardless of whether the physical processor cores are in physical proximity. As another example, providing the number of physical processor cores may form an automatic grouping based on the number in the group and the topology of the processors.

The runtime system may be designed to run on a variety of different platforms, and it may select a physical processor core of a logical core based on various factors, such as the topology of the processor, utilization of the physical processor core, or a fixed mapping of functions within the processor. As one example, a processor topology register may store topology information about the processor, such as the arrangement of physical processor cores. Thus, a runtime system may be executed on a processor having physical processor cores arranged in an array of 1 × 4, 2 × 2, 2 × 4, 4 × 4, 8 × 8, 4 × 8, or other size. Using the topology information, the runtime system may determine which physical processor cores are physically adjacent and/or physically close to each other. It may be desirable for the physical processor cores of the logical core to be close together to reduce communication latency between the physical processor cores. The runtime system may maintain a list of free physical processor cores. The runtime system may preferentially select a physical processor core of a logical core from a processor region in which there are a sufficient number of free and adjacent physical processor cores. Alternatively, the runtime system may use idle physical processor cores anywhere on the processor to select the physical processor core of the logical core.

At process block 1420, the composition control register of the first physical processor core may optionally be programmed to control sharing of a given resource of the first physical processor core with the second physical processor core. For example, controlling the sharing of the given resource may include keeping the given resource private to the first physical processor core. Thus, within a logical processor, the respective physical processor core may have private resources that are not shared between other physical processor cores of the logical processor. As another example, controlling the sharing of the given resource may include sharing the given resource with the second physical processor core. When a given resource is shared, it may be part of a larger logical resource. The composition control register of the first physical processor core may also be programmed to configure the first physical processor core to be in a different power mode. As a specific example, the cache of the first physical processor core may be in a low power mode and not used by the logical processor. This may be beneficial when the first physical processor core is added to an existing logical processor and it is desired that the cache not be remapped. Furthermore, the data set of the thread may be smaller and powering off the cache of the first physical processor core may save energy when executing the program.

At process block 1430, a given thread of a program may be assigned to a logical processor. For example, a logical processor may be reserved for instruction blocks associated with a thread identifier of a given thread. A logical core number may be assigned to a logical processor and stored in a logical core number register of each physical processor core of the logical processor. The logical core number may be the same as the thread identifier or may be associated with the thread identifier.

At process block 1440, execution of a first instruction block of a given thread may be initiated on a first physical processor core of a logical processor. For example, an address of a first instruction block may be loaded onto a first physical processor core such that instructions of the first instruction block may be fetched and executed by the first physical processor core. As a specific example, the thread initialization instruction block may store an address of a first instruction block of the given thread in a program counter register of the first physical processor core. When the thread initialization instruction block commits and the address of the first instruction block is updated in the program counter register, the first instruction block (and the given thread) may begin execution.

XI.Example method of executing a program

15-17 are flow diagrams illustrating example methods of configuring and/or executing threads of a program on logical processors and/or logical resources of a processor. For example, a processor may include a plurality of block-based physical processor cores including a first physical processor core and a second physical processor core. A respective physical processor core may include one or more shareable resources such as caches, register files, execution units, and/or various other logical or mathematical components. The physical processor cores may be grouped or compounded into a logical processor. Similarly, shareable resources of physical processor cores may be grouped or compounded into logical resources. FIG. 15 is a flow diagram illustrating an example method 1500 for executing instructions of threads of a program on a physical processor core using shared logical resources.

At process block 1510, a given resource of the first physical processor core may be configured to be shared with the second physical processor core according to the programmable composition control register of the first physical processor core. For example, the composition control registers may be programmed by the thread initialization instruction block during runtime of the program. A given resource may be configured to be shared when a block of thread initialization instructions is committed. A given resource may be configured as part of a larger logical resource. For example, a given resource may be a cache, and a given resource may be a larger pool of logical caches. Shareable resources of the first physical processor core may be selectively shared such that the first resource may be shared while other resources are not shared. The unshared resources may be used locally by the first physical processor core or the unshared resources may be powered down. For example, the given resource may be the only primary functional unit that the first physical processor core is to be powered on and operational.

At process block 1520, a set of physical processor cores may optionally be configured to share resources according to the programmable composition topology register. For example, the set of physical processor cores may include a first physical processor core and a second physical processor core. The set of physical processor cores may be specified by a bitmap of the set within a field of the composition topology register. Alternatively, the set of physical processor cores may be specified by the number of sets of physical processor cores programmed into the composition topology register. For example, a bitmap may be used to specify physical processor cores from anywhere on the processor, regardless of whether the physical processor cores are adjacent to each other or not. As another example, the number of physical processor cores may form an automatic grouping based on the number in the group and the topology of the processors. In particular, for a logical processor of a given size, there may be a fixed mapping of physical processor cores to logical processors, such as described with reference to fig. 7.

At process block 1530, instructions of the instruction block may be executed on the second physical processor core using the sharable resources of the first physical processor core. In particular, the decoded instructions may be stored in an instruction window of the second physical processor core and issued by instruction issue or scheduling logic of the second physical processor core for execution. The sharable resources of the first physical processor core are used for execution of instructions. As one example, operands of instructions may be loaded or read from shareable resources. As another example, the instruction may be partially or fully executed using a shared execution unit of the first physical processor core. As another example, the instructions may store or write the output results to the shareable resource.

Executing the instructions may include various steps or micro-operations of the instructions, including: managing communication between the second physical processor core and the first physical processor core; mapping an index field associated with the instruction to the shareable resource; scheduling micro-operations of the instruction; and marking the instruction as complete in the instruction window of the second physical processor core. The instruction may take multiple clock cycles to complete to account for communication delays between the second physical processor core and the first physical processor core.

FIG. 16 is a flow diagram illustrating an example method 1600 for executing a thread of a program on a logical processor. For example, a processor may include a plurality of block-based physical processor cores including a first physical processor core and a second physical processor core. A respective physical processor core may include one or more shareable resources such as caches, register files, execution units, and/or various other logical or mathematical components. The physical processor cores may be grouped or compounded into a logical processor. Similarly, shareable resources of physical processor cores may be grouped or compounded into logical resources.

At process block 1610, a logical processor may be composed from programmable composition topology registers. The logical processor may include a first physical processor core and a second physical processor core. For example, the composition topology register may be programmed during runtime of the program, such as by executing instructions of a thread initialization instruction block for a thread. Logical processors may be used to execute non-speculative and speculative instruction blocks of a thread of a program. The physical processor cores of a logical processor may be designated by providing a bitmap of the group or by providing the number of physical processor cores in the group. For example, a bitmap may be used to specify individual physical processor cores from any location on the processor. As another example, the physical processor cores of a group may have a fixed mapping depending on the number in the group and the topology of the processors.

At process block 1620, a given resource of a first physical processor core may optionally be configured to be shared with a second physical processor core according to a composite control register of the first physical processor core. Shareable resources of a physical processor core may have a default setting for indicating whether the resource is shared with other physical processor cores. For example, by default, shareable resources may be private within a physical processor core. As another example, by default, shareable resources may be shared among physical processor cores of a logical processor. Different shareable resources may have different default settings. For example, a cache may default to being shared among physical processor cores of a logical processor, and an execution unit may default to being private within its physical processor cores. The composition control register may be used to override the default shared setting by programming different shared settings in the composition control register.

At process block 1630, a first instruction block of a thread may be non-speculatively executed on a first physical processor core. A thread may have a single non-speculative block of instructions that execute at a given time. A non-speculative instruction block is the first issued or oldest instruction block of a thread. Additional instruction blocks of the thread may start on a different instruction window or a different physical processor core before the non-speculative instruction blocks are committed. However, the additional instruction block is speculative until the additional instruction block is known to commit. For example, a non-speculative instruction block may receive an interrupt to be serviced before additional instruction blocks may be committed. As another example, additional instruction blocks may be on the mispredicted path. When a non-speculative block commits, it may send a commit signal to all other physical processor cores of the logical processor along with the exit branch address. The speculation blocks located on the correct path may continue execution while blocks located on the unused path may be aborted.

The non-speculative execution of the first instruction block may include: predicting a branch target address of the first instruction block such that a path of the speculative instruction block for the thread may be determined; sending the predicted branch target address to one or more physical processor cores of a logical processor; executing instructions of the first instruction block using logical resources shared with another physical processor core; determining whether a commit condition is satisfied; starting a commit phase when a commit condition is satisfied; and sending a commit signal and the target branch address to other physical processor cores of the logical processor when the commit phase is complete.

At process block 1640, a second instruction block of the thread may be speculatively executed on a second physical processor core. The second physical processor core may speculatively issue instructions until it is known whether the second instruction block is on the wrong predicted path or until the second instruction block becomes a non-speculative instruction block. For example, the second instruction block may become a non-speculative instruction block if its starting address matches the target branch address of the committed non-speculative instruction block. Even if the speculative instruction block completes execution before the non-speculative instruction block, the speculative instruction block cannot commit until the speculative instruction block is converted to a non-speculative instruction block.

Speculative execution of the second instruction block may include: fetching the second instruction block prior to the commit of the first instruction block; executing instructions of the first instruction block using logical resources shared with another physical processor core; comparing the start address of the second instruction block with the target branch address when the commit signal is received; converting the second instruction block to a non-speculative instruction block when the start address of the second instruction block matches the target branch address when the commit signal is received; comparing the predicted path to the execution path; aborting the second instruction block when the predicted path is different from the executed path; and delaying the commit phase until the second instruction block is a non-speculative instruction block.

In this manner, the physical processor cores of the logical processor may be used to execute instruction blocks of the thread. In particular, both speculative and non-speculative instruction blocks may be executed concurrently on a physical processor core of a logical processor. When the speculative path is correctly predicted, non-speculative instruction blocks may be migrated between physical processor cores by converting one of the speculative instruction blocks to a non-speculative instruction block at the time of the commit of the earlier non-speculative instruction block. The physical processor core selected to speculatively execute the respective instruction block of the thread may be selected according to an allocation policy. For example, the allocation policy may be a round robin scheduling of the physical processor cores of the logical processor.

FIG. 17 is a flow diagram illustrating an example method 1700 for initiating execution of a thread of a program on a logical processor. For example, a processor may include multiple block-based physical processor cores. A respective physical processor core may include one or more shareable resources such as caches, register files, execution units, and/or various other logical or mathematical components. The physical processor cores may be grouped or compounded into a logical processor. Similarly, shareable resources of physical processor cores may be grouped or compounded into logical resources.

At 1710, speculative execution may be quiesced (quiesce) for instruction blocks that execute after the thread initializes the instruction block. The thread initialization instruction block may include instructions for configuring resources as threads to execute programs. By quiescing or stopping speculative execution after a thread initialization instruction block, idle physical processor cores may be identified and programmed to constitute a logical processor. The runtime system may dynamically track idle physical processor cores so that the composition of logical processors may be based on current runtime information.

At 1720, store operations to the composition topology and/or composition control registers may be performed from within the thread initialization instruction block. For example, each physical processor core of a logical processor may have a composition topology register that specifies whether the physical processor core is to be composited as part of the logical processor core and whether resources of the physical processor core are to be shared. The composition topology registers for respective physical processor cores of the logical processor may be programmed differently such that different resources of the respective physical processor cores are shared. Each physical processor core of a logical processor may have a composition topology register for specifying a set of physical processor cores with which resources are to be composited and/or shared.

At 1730, a branch operation on a first instruction block of the thread may be performed during a commit phase of the thread initialization instruction block to initiate execution of the thread on the logical processor. The branch operation may include loading an address of the first instruction block into a program counter of one of the physical processor cores of the logical processor. When a thread initialization instruction block is committed, the non-speculative instruction block will become the first instruction block of the thread. Thus, a thread may begin execution from the first instruction block.

XII.Example computing Environment

Fig. 18 illustrates a generalized example of a suitable computing environment 1800 in which the described embodiments, techniques, and technologies may be implemented.

The computing environment 1800 is not intended to suggest any limitation as to the scope of use or functionality of the technology, as the technology may be implemented in diverse general-purpose or special-purpose computing environments. For example, the disclosed technology may be implemented with other computer system configurations, including hand-held devices, multiprocessor systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. The disclosed technology may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules (including executable instructions for block-based instruction blocks) may be located in both local and remote memory storage devices.

Referring to FIG. 18, the computing environment 1800 includes at least one block-based processing unit 1810 and memory 1820. In fig. 18, this most basic configuration 1830 is included within the dashed line. Block-based processing unit 1810 executes computer-executable instructions and may be a real processor or a virtual processor. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power, and as such multiple processors may run concurrently. The memory 1820 can be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or a combination of the two. The memory 1820 stores software 1880, images, and video that may implement the techniques described herein, for example. The computing environment may have additional features. For example, the computing environment 1800 includes storage 1840, one or more input devices 1850, one or more output devices 1860, and one or more communication connections 1870. An interconnection mechanism (not shown), such as a bus, controller, or network, interconnects the components of the computing environment 1800. Typically, operating system software (not shown) provides an operating environment for other software executing in the computing environment 1800, and coordinates activities of the components of the computing environment 1800.

Storage 1840 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, CD-RWs, DVDs, or any other medium which can be used to store information and which can be accessed within the computing environment 1800. Storage 1840 stores instructions, insert data, and messages for software 1880, which may be used to implement the techniques described herein.

Input device(s) 1850 may be a touch input device, such as a keyboard, a keypad, a mouse, a touch screen display, a pen or trackball, a voice input device, a scanning device, or another device that provides input to computing environment 1800. For audio, the input device(s) 1850 can be a sound card or similar device that accepts audio input in analog or digital form, or a CD-ROM reader that provides audio samples to the computing environment 1800. Output device(s) 1860 can be a display, printer, speaker, recorder or another device that provides output from computing environment 1800.

Communication connection(s) 1870 enable communication with another computing entity over a communication medium (e.g., a connecting network). The communication medium conveys information such as computer-executable instructions, compressed graphics information, video, or other data in a modulated data signal. The communication connection(s) 1870 is not limited to wired connections (e.g., megabit or gigabit ethernet, infiniband, fibre channel over electrical or fibre optic connections), but also includes wireless technologies (e.g., RF connections via bluetooth, WiFi (IEEE 802.11a/b/n), WiMax, cellular, satellite, laser, infrared) and other suitable communication connections for providing network connectivity for the disclosed agents, bridges and agent data consumers. In a virtual host environment, the communication connection(s) may be a virtualized network connection provided by the virtual host.

Some embodiments of the disclosed methods may be performed using computer-executable instructions that implement all or portions of the disclosed techniques in computing cloud 1890. For example, the disclosed compiler and/or server of the block-based processor are located in the computing environment 1830, or the disclosed compiler may execute on a server located in the computing cloud 1890. In some examples, the disclosed compiler executes on a conventional central processing unit (e.g., a RISC or CISC processor).

Computer readable media are any available media that can be accessed within the computing environment 1800. By way of example, and not limitation, with the computing environment 1800, computer-readable media include memory 1820 and/or storage 1840. As should be readily understood, the term computer-readable storage medium includes media for data storage (such as the memory 1820 and the storage 1840) and not transmission media (such as modulated data signals).

XIII.Additional examples of the disclosed technology

Additional examples of the disclosed subject matter are discussed herein in terms of the examples discussed above.

In one embodiment, a processor includes a plurality of block-based physical processor cores to execute a program including a plurality of instruction blocks arranged in one or more threads. The respective physical processor core is configured to execute an instruction block of a given thread of the program. The corresponding physical processor core includes a first level data cache and a programmable cache complex control register. The programmable cache complex control register may be used to compose a logical cache for a given thread using a level one data cache and a level one data cache of another of the plurality of block-based physical processor cores. The programmable control register is adapted to be programmed at run-time of the program. The cache composition control register may be programmable in part by executing a store instruction of an instruction block executing on any of the plurality of cores to a memory-mapped address corresponding to the cache composition control register. The value of the cache complex control register may be updated during a commit phase of an instruction block having a store instruction to a memory mapped address corresponding to the cache complex control register.

As one example, the cache composition control register may include a field for programming the number of level one data caches to be used for the logical cache, and the level one data caches to be used for the logical cache are determined only by the relative physical location of the corresponding physical processor core and the number of level one data caches to be used for the logical cache. The mapping of addresses to memory of the first level data cache of the logical cache is predetermined by the relative physical location of the corresponding first level data cache and the number of first level data caches to be used for the logical cache. Alternatively, the cache complex control register may include a first field for programming the number of level one data caches to be used for the logical cache, and the level one data caches for the logical cache are identified using an additional field of the cache complex control register. Corresponding fields of the additional fields of the cache complex control register correspond to a mapping to an address of a level one data cache of the logical cache.

Processors may be used in a variety of different computing systems. For example, the server computer may include: non-volatile memory and/or storage; network connection; a memory storing one or more instruction blocks; and a processor including a block-based processor core for executing the instruction block. As another example, an apparatus may include: a user interface component; non-volatile memory and/or storage; cellular and/or network connectivity; a memory storing one or more of the instruction blocks; and a processor including a block-based processor for executing the instruction block. The user interface components may include at least one or more of: a display, a touch screen display, a tactile input/output device, a motion sensing input device, and/or a voice input device.

In one embodiment, a processor includes a plurality of block-based processor cores for executing a program including a plurality of instruction blocks. The respective block-based processor core includes one or more sharable resources and a programmable composition control register. The programmable composition control register may be used to configure which of the one or more shareable resources are shared with other processor cores of the plurality of processor cores. The programmable control register is dynamically programmable during execution of the program.

As one example, one or more sharable resources may be shared only within a logical processor core that includes a respective block-based processor core, where the logical processor core is configured to execute a single thread of a program. The respective block-based processor cores may also include programmable composition topology registers to specify which of the plurality of block-based processor cores are to be composited to form a logical processor core. Thus, one or more shareable resources may be programmed to be shared within a single thread. As another example, the one or more shareable resources may include a level one data cache. The respective block-based processor core may be configured to execute a first thread of a program, and the level-one data cache may be configured to be shared with a second block-based processor core executing a second thread of the program. Thus, one or more shareable resources may be programmed to be shared across multiple threads.

The one or more shareable resources may be selected from one or more of the following: a level one data cache, a register bank, an execution unit, and a complete corresponding block-based processor core. The composition control register may be programmable to place a portion of the respective core in a low power mode while sharing selected ones of the one or more shareable resources. The composition control register may be programmable to disable particular ones of the one or more shareable resources during operation of the respective block-based processor core.

In one embodiment, a method of executing a program on a processor may be performed. A processor may include a plurality of block-based processor cores, where respective processor cores include one or more sharable resources. The method comprises the following steps: a given shareable resource of the first processor core is configured to be shared with a second processor core of the plurality of processor cores in accordance with the programmable composition control register of the first processor core. The method further comprises the following steps: instructions of the instruction block are executed on the second processor core using a given sharable resource configured on the first processor core. The given shareable resource may be a cache, and the cache of the first processor core may be a first bank of logical caches, and the cache of the second processor core may be a second bank of logical caches. The memory address of the instruction may include an index field, and the index field may be used to determine whether the memory is mapped to the first bank or the second bank.

As one example, a first processor core may be configured to execute a first thread of instructions of a program and a second processor core may be configured to execute a second thread of instructions of the program. As another example, the first processor core may be in a low power mode. As another example, the first processor core and the second processor core may be combined to form a logical processor core that executes a single thread of the program.

In view of the many possible embodiments to which the principles of the disclosed subject matter may be applied, it should be recognized that the illustrated embodiments are only preferred examples and should not be taken as limiting the scope of the claims to those preferred examples. Rather, the scope of the claimed subject matter is defined by the appended claims. We therefore claim in accordance with our invention all that comes within the scope of these claims.

58页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于多核处理器的高速数据流处理方法和系统

Block-based processor core composition register

相关技术

网友询问留言