System with hybrid-threaded processor, hybrid-threaded fabric with configurable computing elements, and hybrid interconnect network
阅读说明:本技术 具有混合线程处理器的系统、具有可配置计算元件的混合线程组构以及混合互连网络 (System with hybrid-threaded processor, hybrid-threaded fabric with configurable computing elements, and hybrid interconnect network ) 是由 T·M·布鲁尔 于 2018-10-31 设计创作,主要内容包括:公开用于可配置计算的代表性设备、方法和系统实施例。在代表性实施例中,一种系统包含:第一互连网络;处理器,其耦合到所述互连网络;主机接口,其耦合到所述互连网络;以及至少一个可配置电路群,其耦合到所述互连网络,所述可配置电路群包含:布置成阵列的多个可配置电路;第二异步包网络,其耦合到所述阵列的所述多个可配置电路中的每个可配置电路;第三同步网络,其耦合到所述阵列的所述多个可配置电路中的每个可配置电路;存储器接口电路,其耦合到所述异步包网络和所述互连网络;以及调度接口电路,其耦合到所述异步包网络和所述互连网络。(Representative device, method, and system embodiments for configurable computing are disclosed. In a representative embodiment, a system comprises: a first interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and at least one configurable circuit group coupled to the interconnection network, the configurable circuit group including: a plurality of configurable circuits arranged in an array; a second asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array; a third synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; a memory interface circuit coupled to the asynchronous packet network and the interconnection network; and scheduling interface circuitry coupled to the asynchronous packet network and the interconnection network.)
1. A system, comprising:
a first interconnection network;
a processor coupled to the interconnection network;
a host interface coupled to the interconnection network; and
at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising:
a plurality of configurable circuits arranged in an array;
a second asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array;
a third synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array;
a memory interface circuit coupled to the asynchronous packet network and the interconnection network; and
a scheduling interface circuit coupled to the asynchronous packet network and the interconnection network.
2. The system of claim 1, wherein the interconnection network comprises:
a first plurality of crossbars having a folded Clos configuration and a plurality of direct mesh connections at interfaces with system endpoints.
3. The system of claim 2, wherein the asynchronous packet network comprises:
a second plurality of crossbars, each crossbar coupled to at least one configurable circuit of the plurality of configurable circuits of the array and another crossbar of the second plurality of crossbars.
4. The system of claim 3, wherein the synchronization network comprises:
a plurality of direct point-to-point connections coupling adjacent configurable circuits in the array of the plurality of configurable circuits of the group of configurable circuits.
5. The system of claim 1, wherein each configurable circuit of the plurality of configurable circuits comprises:
a configurable computing circuit;
a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit;
a thread control circuit; and a plurality of control registers;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronization network inputs coupled to the configurable computing circuitry and the synchronization network;
a plurality of synchronization network outputs coupled to the configurable computing circuitry and the synchronization network;
an asynchronous network input queue coupled to the asynchronous packet network;
an asynchronous network output queue coupled to the asynchronous packet network;
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
6. The system of claim 5, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction for the configurable computing circuit.
7. The system of claim 5, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.
8. The system of claim 5, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and a data path configuration instruction index to select a synchronized network output of the plurality of synchronized network outputs.
9. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:
a configuration memory multiplexer coupled to the first instruction memory and the second instruction and instruction index memory.
10. The system of claim 9, wherein the current datapath configuration instruction is selected using an instruction index from the second instruction and an instruction index memory when a select input of the configuration memory multiplexer has a first setting.
11. The system of claim 10, wherein when the select input of the configuration memory multiplexer has a second setting different from the first setting, the current datapath configuration instruction is selected using an instruction index from the master synchronization input.
12. The system of claim 5, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and a data path configuration instruction index to configure portions of the configurable circuit independent of the current data path instruction.
13. The system of claim 12, wherein selected ones of the plurality of spoke instruction and data path configuration instruction indices are selected according to a modular spoke count.
14. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:
a conditional logic circuit coupled to the configurable computing circuit.
15. The system of claim 14, wherein the conditional logic circuitry is to provide a first next instruction or instruction index for execution by a next configurable circuit or to provide a second, different, next instruction or instruction index for execution by the next configurable circuit, in dependence on an output from the configurable computing circuitry.
16. The system of claim 14, wherein the conditional logic circuit is to modify the next datapath instruction index provided on a selected one of the plurality of synchronous network outputs as a function of output from the configurable computation circuit.
17. The system of claim 14, wherein the conditional logic circuit, in dependence upon an output from the configurable computation circuit, is to provide a conditional branch by modifying the next datapath instruction or a next datapath instruction index provided on a selected output of the plurality of synchronous network outputs.
18. The system of claim 14, wherein when enabled, the condition logic circuit is to provide a conditional branch by oring the output from the configurable computation circuit with least significant bits of the next datapath instruction or the next datapath instruction index to specify the next datapath instruction or datapath instruction index.
19. The system of claim 5, wherein the plurality of synchronous network inputs comprises:
a plurality of input registers coupled to a plurality of communication lines of the synchronous network; and
an input multiplexer coupled to the plurality of input registers and the second instruction and instruction index memory to select the master synchronization input.
20. The system of claim 5, wherein the plurality of synchronized network outputs comprises:
a plurality of output registers coupled to a plurality of communication lines of the synchronous network; and
an output multiplexer coupled to the configurable computing circuitry to select an output from the configurable computing circuitry.
21. The system of claim 5, wherein the plurality of synchronized network outputs comprises:
an asynchronous fabric state machine coupled to the asynchronous network input queue and the asynchronous network output queue, the asynchronous fabric state machine to decode input packets received from the asynchronous packet network and assemble output packets for transmission over the asynchronous packet network.
22. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:
a direct path connection between the plurality of input registers and the plurality of output registers.
23. The system of claim 22, wherein the direct path connection of a first configurable circuit of the plurality of configurable circuits provides a direct point-to-point connection for data transfer of received data over the synchronization network from a second configurable circuit of the plurality of configurable circuits received over the synchronization network to a third configurable circuit of the plurality of configurable circuits.
24. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:
arithmetic, logical, and bit operation circuitry for performing at least one integer operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.
25. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:
arithmetic, logical, and bit operation circuitry for performing at least one floating point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.
26. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:
a multiply and shift operation circuit for performing at least one integer operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.
27. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:
a multiply and shift operation circuit for performing at least a floating point operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.
28. The system of claim 5, wherein the scheduling interface circuit is to receive a work descriptor packet over the first interconnection network and, in response to the work descriptor packet, generate one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform selected computations.
29. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:
a flow control circuit coupled to the asynchronous network output queue, the flow control circuit to generate a stop signal when a predetermined threshold is reached in the asynchronous network output queue.
30. The system of claim 29, wherein each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal.
31. The system of claim 29, wherein each configurable compute circuit, in response to the stall signal, stalls execution after its current instruction is completed.
32. The system of claim 5, wherein a first plurality of configurable circuits in the array of a plurality of configurable circuits are coupled in a first predetermined order through the synchronization network to form a first synchronization domain; and wherein a second plurality of configurable circuits in the array of configurable circuits is coupled in a second predetermined order through the synchronization network to form a second synchronization domain.
33. The system of claim 32, wherein the first synchronization domain is configured to generate a continue message for transmission to the second synchronization domain over the asynchronous packet network.
34. The system of claim 32, wherein the second synchronization domain is configured to generate a completion message for transmission to the first synchronization domain over the asynchronous packet network.
35. The system of claim 5, wherein the plurality of control registers store a completion table having a first data completion count.
36. The system of claim 35, wherein the plurality of control registers further stores the completion table having a second iteration count.
37. The system of claim 35, wherein the plurality of control registers further stores a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after a current thread is executed.
38. The system of claim 37, wherein the plurality of control registers further store an identification of a first iteration and an identification of a last iteration in the loop table.
39. The system of claim 35 wherein the control circuitry is to queue a thread for execution when a thread identifier for the thread, a completion count for the thread is decremented to zero and its thread identifier is the next thread.
40. The system of claim 35, wherein the control circuitry is to queue a thread for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier.
41. The system of claim 35, wherein the completion count indicates, for each selected thread of a plurality of threads, a predetermined number of completion messages received before execution of the selected thread.
42. The system of claim 35, wherein the plurality of control registers further store a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.
43. The system of claim 35, wherein the plurality of control registers further store a completion table having a loop count for the number of active loop threads, and wherein in response to receiving an asynchronous fabric message that returns a thread identifier to a thread identifier pool, the control circuitry decrements the loop count and transmits an asynchronous fabric completion message when the loop count reaches zero.
44. The system of claim 35, wherein the plurality of control registers further store a top of a thread identifier stack to allow each type of thread identifier to access a private variable for a selected loop.
45. The system of claim 5, wherein the control circuit further comprises:
continuing the queue; and
and re-entering the queue.
46. The system of claim 45, wherein the continuation queue stores one or more thread identifiers for computing threads that have completion counts allowed to execute but do not yet have an assigned thread identifier.
47. The system of claim 45, wherein the re-entry queue stores one or more thread identifiers for compute threads having completion counts allowed to execute and having an assigned thread identifier.
48. The system of claim 45, wherein any thread in the re-entry queue having a thread identifier is executed before any thread in the continue queue having a thread identifier is executed.
49. The system of claim 5, wherein the control circuit further comprises:
a priority queue, wherein any thread in the priority queue having a thread identifier executes before any thread in the resume queue or the re-entry queue having a thread identifier executes.
50. The system of claim 5, wherein the control circuit further comprises:
a run queue, wherein any thread in the run queue having a thread identifier executes after a spoke count at which the thread identifier occurs.
51. The system of claim 5, wherein the control circuitry is to self-schedule a computing thread for execution.
52. The system of claim 5, wherein the control circuitry is to order compute threads for execution.
53. The system of claim 5, wherein the control circuitry is to order loop computation threads for execution.
54. The system of claim 5, wherein the control circuitry is to begin executing a computing thread in response to one or more completion signals from data dependencies.
55. The system of claim 1, wherein the processor comprises:
a processor core to execute the received instructions; and
core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to a received work descriptor packet or in response to a received event packet.
56. The system of claim 1, wherein the processor comprises:
a processor core to execute a shred creation instruction; and
core control circuitry coupled to the processor core, the core control circuitry to automatically schedule the shred creation instructions for execution by the processor core and generate one or more job descriptor data packets destined for another second processor or the group of configurable circuits to execute a corresponding plurality of execution threads.
57. The system of claim 1, wherein the processor comprises:
a processor core to execute a shred creation instruction; and
core control circuitry coupled to the processor core, the core control circuitry to schedule the shred creation instructions for execution by the processor core, reserve a predetermined amount of memory space in thread control memory to store return arguments, and generate one or more work descriptor data packets destined for another second processor or the group of configurable circuits to execute a corresponding plurality of execution threads.
58. The system of claim 1, wherein the processor comprises:
a processor core; and
a core control circuit, comprising:
an interconnection network interface;
a core control memory coupled to the interconnect network interface;
an execution queue coupled to the core control memory;
control logic and thread selection circuitry coupled to the execution queue and the core control memory; and
an instruction cache coupled to the control logic and thread selection circuitry and the processor core.
59. The system of claim 1, wherein the processor comprises:
a processor core; and
a core control circuit, comprising:
an interconnection network interface;
a thread control memory coupled to the interconnect network interface;
a network response memory coupled to the interconnect network interface;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory;
an instruction cache coupled to the control logic and thread selection circuitry and the processor core; and
a command queue coupled to the processor core and the interconnect network interface.
60. The system of claim 1, wherein the processor comprises:
a processor core; and
a core control circuit, comprising:
a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution by the processor core of instructions of the execution thread, the processor core using data stored in the data cache or general purpose register.
61. The system of claim 1, wherein the processor comprises:
a processor core; and
a core control circuit, comprising:
a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers;
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, and periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.
62. The system of claim 1, wherein the processor comprises:
a processor core; and
a core control circuit, comprising:
a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers;
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration that remains unchanged, and suspend thread execution by not returning the thread identifier to the execution queue when the thread identifier has a suspended state.
63. The system of claim 1, wherein the processor comprises:
a processor core to execute a plurality of instructions; and
core control circuitry coupled to the processor core, the core control circuitry comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and
an instruction cache coupled to the processor core and the control logic and thread selection circuitry to receive the initial program count and provide a corresponding one of the plurality of instructions to the processor core for execution.
64. The system of claim 1, wherein the processor comprises:
a core control circuit, comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet and decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, periodically select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution;
and
a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.
65. The system of claim 1, wherein the processor comprises:
a core control circuit, comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread;
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and
a command queue; and
a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.
66. The system of claim 1, wherein the processor comprises:
a core control circuit, comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and
a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.
67. The system of claim 1, wherein the processor comprises:
a core control circuit, comprising:
an interconnection network interface coupleable to the interconnection network to receive the invoke work descriptor packet, decode the received work descriptor packet into an execution thread having an initial program count and any received arguments, and encode the work descriptor packet for transmission to other processing elements;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
a network response memory coupled to the interconnect network interface;
control logic and thread selection circuitry coupled to the execution queue, the thread control memory, and the instruction cache, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread;
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and
a command queue storing one or more commands to generate one or more work descriptor packets;
and
a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.
68. The system of claim 1, wherein the processor comprises:
a processor core; and
core control circuitry coupled to the processor core.
69. The system of claim 68, wherein the core control circuitry comprises:
an interconnection network interface coupleable to an interconnection network, the interconnection network interface to receive and decode a work descriptor packet into a thread of execution having an initial program count and any received arguments, the interconnection network interface further to receive and decode an event packet into an event identifier and any received arguments.
70. The system of claim 69, wherein the interconnection network interface is further for generating a point-to-point event data message or generating a broadcast event data message.
71. The system of claim 69, wherein the core control circuitry comprises:
control logic and thread selection circuitry coupled to the interconnect network interface, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread.
72. The system of claim 71, wherein the core control circuitry comprises:
a thread control memory having a plurality of registers, the plurality of registers comprising:
a thread identifier pool register to store a plurality of thread identifiers;
a thread state register;
a program count register to store the received initial program count; and
a general register to store the received argument.
73. The system of claim 72, wherein the control logic and thread selection circuitry are further to return a corresponding thread identifier for the selected thread to the thread identifier pool register in response to the processor core executing a return instruction.
74. The system of claim 72, wherein the control logic and thread selection circuitry are further to clear the registers of the thread control memory indexed by the corresponding thread identifier of the selected thread in response to the processor core executing a return instruction.
75. The system of claim 72, wherein the thread control memory further comprises one or more registers selected from the group consisting of:
a pending fiber return count register; returning to the argument buffer or register; the thread returns to the argument chain table register; self-defining an atomic transaction identifier register; caching data; an event status register; an event mask register; and combinations thereof.
76. The system of claim 72, wherein the interconnection network interface is further for storing the execution thread with the initial program count and any received arguments in the thread control memory using a thread identifier as an index to the thread control memory.
77. The system of claim 72, wherein the core control circuitry further comprises:
an execution queue coupled to the thread control memory, the execution queue storing one or more thread identifiers.
78. The system of claim 77, wherein the control logic and thread selection circuitry are further to place the thread identifier in the execution queue, select the thread identifier from the execution queue for execution, and access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread for execution by the processor core.
79. The system of claim 77, wherein the control logic and thread selection circuitry are further to successively select a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread.
80. The system of claim 77, wherein the control logic and thread selection circuitry are further to perform a round-robin or barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread until the execution thread is completed.
81. The system of claim 77, wherein the control logic and thread selection circuitry are further configured to assign an active state or a suspended state to a thread identifier.
82. The system of claim 77, wherein the control logic and thread selection circuitry are further to assign a priority status to a thread identifier.
83. The system of claim 77, wherein the control logic and thread selection circuitry are further to return a corresponding thread identifier having an assigned valid state and an assigned priority to the execution queue after execution of a corresponding instruction.
84. The system of claim 77, wherein the execution queue further comprises:
a first priority queue; and
a second priority queue.
85. The system of claim 84, wherein the control logic and thread selection circuitry further comprises:
thread selection control circuitry coupled to the execution queue, the thread selection control circuitry to select a thread identifier from the first priority queue at a first frequency and to select a thread identifier from the second priority queue at a second frequency, the second frequency being lower than the first frequency.
86. The system of claim 85, wherein the thread selection control circuitry is to determine the second frequency as a skip count starting with selecting a thread identifier from the first priority queue.
87. The system of claim 77, wherein the core control circuitry further comprises:
a network command queue coupled to the processor core.
88. The system of claim 87, wherein the processor core is to execute a shred creation instruction to generate one or more commands that cause the network command queue of the interconnect network interface to generate one or more call work descriptor packets destined for another processor core or a hybrid thread fabric circuit.
89. The system of claim 88, wherein the core control circuitry, in response to the processor core executing a shred creation instruction, is to reserve a predetermined amount of memory space in the general purpose register or a return argument register.
90. The system of claim 77, wherein in response to generating one or more call job descriptor packets destined for another processor core or a hybrid thread fabric, the core control circuitry is to store a thread return count in the thread return register.
91. The system of claim 90 wherein in response to receiving a return data packet, the core control circuitry is to decrement the thread return count stored in the thread return register.
92. The system of claim 92 wherein, in response to the thread return count in the thread return register decrementing to zero, the core control circuitry is to change a suspended state of a corresponding thread identifier to an active state for subsequent execution of a thread return instruction to complete a created shred or thread.
93. The system of claim 71, wherein the core control circuitry further comprises:
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution by the processor core.
94. The system of claim 71, wherein the control logic and thread selection circuitry are further to assign an initial valid state to the execution thread.
95. The system of claim 71, wherein the control logic and thread selection circuitry are further to allocate a suspended state to the execution thread in response to the processor core executing a memory load instruction or a memory store instruction.
96. The system of claim 71, wherein the control logic and thread selection circuitry are further to end executing a selected thread in response to the processor core executing a return instruction.
97. The system of claim 71, wherein the interconnection network interface is further to generate a return job descriptor packet in response to the processor core executing a return instruction.
98. The system of claim 68, wherein the core control circuitry further comprises:
the network response memory.
99. The system of claim 98, wherein the network response memory comprises:
a memory request register; and
a thread identifier and a transaction identifier register.
100. The system of claim 99, wherein the network response memory further comprises one or more registers selected from the group consisting of:
requesting a cache line index register; a byte register; a general register index and type register; and combinations thereof.
101. The system of claim 71, wherein the control logic and thread selection circuitry are further to respond to received event data packets with an event mask stored in the event mask register.
102. The system of claim 71, wherein the control logic and thread selection circuitry are further to determine an event number corresponding to a received event data packet.
103. The system of claim 71, wherein the control logic and thread selection circuitry are further to change the state of a thread identifier from suspended to active to resume execution of a corresponding thread of execution in response to a received event data packet.
104. The system of claim 71, wherein the control logic and thread selection circuitry are further to change the state of a thread identifier from suspended to active to resume execution of a corresponding thread of execution in response to an event number of a received event data packet.
105. The system of claim 71, wherein the interconnection network interface comprises:
inputting a queue;
a packet decoder circuit coupled to the input queue, the control logic and thread selection circuit, and the thread control memory;
an output queue; and
a packet encoder circuit coupled to the output queue, the network response memory, and the network command queue.
106. The system of claim 71, wherein the core control circuitry further comprises:
data path control circuitry for controlling access size through the first interconnection network.
107. The system of claim 71, wherein the core control circuitry further comprises:
data path control circuitry to increase or decrease a memory load access size in response to a time-averaged usage level.
108. The system of claim 71, wherein the core control circuitry further comprises:
data path control circuitry to increase or decrease a memory storage access size in response to a time-averaged usage level.
109. The system of claim 71, wherein the control logic and thread selection circuitry are further to increase a size of a memory load access request to correspond to a cache line boundary of the data cache.
110. The system of claim 71, wherein the core control circuitry further comprises:
system call circuitry to generate one or more system calls to a host processor.
111. The system of claim 110, wherein the system call circuitry further comprises:
a plurality of system call credit registers storing a predetermined credit count to modulate the number of system calls in any predetermined time period.
112. The system of claim 71, wherein the core control circuitry is further to generate a command to cause a command queue of the interconnect network interface to copy and transmit all data corresponding to a selected thread identifier from the thread control memory to monitor thread status in response to a request from a host processor.
113. The system of claim 68, wherein the processor core is to execute a wait or a non-wait fiber join instruction.
114. The system of claim 68, wherein the processor core is to execute a fibre join instruction.
115. The system of claim 68, wherein the processor core is to execute a non-cache read or load instruction to specify a general purpose register to store data received from memory.
116. The system of claim 68, wherein the processor core is to execute a non-cache write or store instruction to specify data in a general purpose register for storage in memory.
117. The system of claim 68, wherein the core control circuitry is to assign a transaction identifier to any load or store request to memory and correlate the transaction identifier with a thread identifier.
118. The system of claim 68, wherein the processor core is to execute a first thread priority instruction to assign a first priority to an execution thread having a corresponding thread identifier.
119. The system of claim 118, wherein the processor core is to execute a second thread priority instruction to assign a second priority to an execution thread having a corresponding thread identifier.
120. The system of claim 68, wherein the processor core is to execute a custom atomic return instruction to complete a thread of execution of a custom atomic operation.
121. The system of claim 68, wherein the processor core, in conjunction with a memory controller, is to perform floating point atomic memory operations.
122. The system of claim 68, wherein the processor core, in conjunction with a memory controller, is to perform custom atomic memory operations.
123. The system of claim 1, wherein data communication through the interconnection network occurs using a global virtual address space across all nodes, independent of a physical address space.
124. The system of claim 1, wherein data communication through the interconnection network is conducted using a virtual node identifier.
125. The system of claim 1, further comprising:
one or more memory circuits operable in a plurality of modes, the plurality of modes comprising: private and non-shared; sharing and interleaving; shared and non-interleaved.
126. The system of claim 1, wherein data communications over the interconnection network are conducted using split header and payload configurations in order to pipeline multiple communications to multiple different destinations.
127. The system of claim 1, wherein the interconnection network is to use the split header and payload configuration for delay payload switching.
128. The system of claim 1, wherein the interconnection network is to route multiple data payloads as consecutive data bursts using a single header.
129. The system of claim 1, wherein the interconnection network is to interleave a first acknowledgement message to a destination in an unused header field of a second message to the destination.
130. The system of claim 1, wherein the interconnection network is adapted for power gating or clock gating based on load requirements.
131. The system of claim 1, wherein each configurable circuit of the plurality of configurable circuits further comprises:
a plurality of delay registers for synchronizing data transmission and reception for thread execution.
132. The system of claim 1, wherein the at least one configurable circuit group further comprises:
a scheduling interface for receiving a work descriptor packet and distributing the work descriptor packet to a selected configurable circuit of the plurality of configurable circuits based on load balancing.
133. The system of claim 1, wherein the at least one configurable circuit group further comprises:
a scheduling interface to receive one or more completion messages from a configurable circuit of the plurality of configurable circuits and generate a return job descriptor packet having a return value.
134. The system of claim 1, wherein each configurable circuit of the plurality of configurable circuits is further to select valid data elements with an execution mask.
135. A system, comprising:
a first interconnection network;
a processor coupled to the interconnection network;
a host interface coupled to the interconnection network; and
at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising:
a plurality of configurable circuits arranged in an array, each configurable circuit comprising:
a configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs and outputs coupled to the configurable computing circuitry;
an asynchronous network input queue and an asynchronous network output queue;
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the second configuration memory comprising:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing: a plurality of spoke instructions and datapath configuration instruction indices for selecting a master synchronization input of the synchronization network input, for selecting a current datapath configuration instruction of the configurable computation circuit, and for selecting a next datapath instruction or a next datapath instruction index of a next configurable computation circuit; and
a control circuit coupled to the configurable computing circuit, the control circuit comprising:
a memory control circuit;
a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and
thread control circuitry for queuing threads for execution.
136. A system, comprising:
a first interconnection network;
a host interface coupled to the interconnection network;
at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising a plurality of configurable circuits arranged in an array; and
a processor coupled to the interconnection network, the processor comprising:
a processor core to execute a plurality of instructions; and
core control circuitry coupled to the processor core, the core control circuitry comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and
an instruction cache coupled to the processor core and the control logic and thread selection circuitry to receive the initial program count and provide a corresponding one of the plurality of instructions to the processor core for execution.
137. A method of scheduling and controlling execution of instructions or threads of execution in a system having a processor with a processor core and core control circuitry, the processor coupled to a plurality of multi-threaded configurable circuits by a first interconnection network, the method comprising:
receiving, using the core control circuitry, a job descriptor packet;
decoding, using the core control circuitry, the received job descriptor packet into an execution thread having an initial program count and any received arguments;
assigning, using the core control circuitry, an available thread identifier to the execution thread;
automatically queuing, using the core control circuitry, the thread identifier to execute the execution thread; and
periodically selecting the thread identifier for execution of the execution thread by the processor core.
138. The method of claim 137, further comprising:
using the processor core, an execution fiber creates a plurality of execution threads of instructions for execution by a second processing element; and
in response to executing the shred creation instruction, generating, using network interface circuitry, one or more job descriptor data packets destined for the second processing element to execute the plurality of execution threads.
139. The method of claim 138, further comprising:
using the core control circuitry, a predetermined amount of memory space is reserved in a thread control memory to store return arguments in response to executing the shred creation instruction.
140. The method according to claim 137, wherein said queuing and selecting step further comprises:
automatically queuing, using the core control circuitry, the thread identifier to execute the thread of execution when the thread identifier has a valid state; and
periodically selecting, using the core control circuitry, the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.
141. The method of claim 140, further comprising:
using the core control circuitry, when the thread identifier has a suspended state, suspending thread execution by not returning the thread identifier to the execution queue.
142. The method of claim 137, further comprising:
accessing thread control memory using the thread identifier as an index using the core control circuitry to select the initial program count for the execution thread.
143. The method of claim 137, further comprising:
receiving, using the core control circuitry, an event data packet; and
decoding, using the core control circuitry, the received event data packet into an event identifier and any received arguments.
144. The method of claim 137, further comprising:
assigning, using the core control circuitry, an initial valid state to the execution thread.
145. The method of claim 137, further comprising:
using the core control circuitry, a suspend state is assigned to the execution thread in response to executing a memory load instruction.
146. The method of claim 137, further comprising:
using the core control circuitry, a suspend state is assigned to the execution thread in response to executing a memory store instruction.
147. The method of claim 137, further comprising:
terminating, using the core control circuitry, execution of the selected thread in response to executing the return instruction.
148. The method of claim 137, further comprising:
returning, using the core control circuitry, a corresponding thread identifier for the selected thread to a thread identifier pool in response to executing a return instruction.
149. The method of claim 137, further comprising:
clearing, using the core control circuitry, registers of a thread control memory indexed by the corresponding thread identifier of the selected thread in response to executing a return instruction.
150. The method of claim 137, further comprising:
return job descriptor packets are generated using the network interface circuitry in response to executing the return instructions.
151. The method of claim 137, further comprising:
generating a point-to-point event data message or generating a broadcast event data message using network interface circuitry.
152. The method of claim 137, further comprising:
using the core control circuitry, responding to the received event data packet using an event mask.
153. The method of claim 137, further comprising:
determining, using the core control circuitry, an event number corresponding to the received event data packet.
154. The method of claim 137, further comprising:
using the core control circuitry, responsive to an event number of a received event data packet, changing the state of the thread identifier from suspended to active to resume execution of the corresponding thread of execution.
155. The method of claim 137, further comprising:
using the core control circuitry, successively selecting a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread.
156. The method of claim 137, further comprising:
using the core control circuitry, a round robin or barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers is performed, each for executing a single instruction of a corresponding execution thread, until the execution thread is completed.
157. The method of claim 137, further comprising:
assigning, using the core control circuitry, a valid state or a suspended state to a thread identifier;
assigning, using the core control circuitry, a priority status to a thread identifier;
returning, using the core control circuitry, the corresponding thread identifier having an assigned valid state and an assigned priority to the execution queue after execution of a corresponding instruction.
158. The method of claim 137, wherein the selecting step further comprises:
selecting, using the core control circuitry, a thread identifier from a first priority queue at a first frequency and a thread identifier from a second priority queue at a second frequency, the second frequency being lower than the first frequency.
159. The method of claim 158, further comprising:
determining, using the core control circuitry, the second frequency as a skip count starting with selecting a thread identifier from the first priority queue.
160. The method of claim 137, further comprising:
data path access size is controlled using data path control circuitry.
161. The method of claim 160, further comprising:
using data path control circuitry, the memory load access size or the memory store access size is increased or decreased in response to the time-averaged usage level.
162. The method of claim 137, further comprising:
using the core control circuitry, a size of a memory load access request is increased to correspond to a cache line boundary of a data cache.
163. The method of claim 137, further comprising:
using the network interface circuitry, one or more system calls are generated to the host processor.
164. The method of claim 163, further comprising:
the number of system calls within any predetermined time period is modulated using a predetermined credit count.
165. The method of claim 137, further comprising:
all data from the thread control memory corresponding to the selected thread identifier is copied and transferred in response to a request from the host processor using the core control circuitry to monitor thread status.
166. The method of claim 137, further comprising:
executing, using the processor core, a shred creation instruction to generate one or more commands that generate one or more call job descriptor packets destined for another processor core or a hybrid thread fabric circuit;
using the core control circuitry, in response to executing the shred creation instruction, a predetermined amount of memory space is reserved to store any return arguments.
167. The method of claim 166, further comprising:
storing, using the core control circuitry, a thread return count in a thread return register in response to generating one or more call work descriptor packets, and decrementing the thread return count stored in the thread return register in response to receiving a return data packet.
168. The method of claim 167, wherein responsive to the thread return count in the thread return register decrementing to zero, using the core control circuitry, a suspended state of a corresponding thread identifier is changed to an active state for subsequent execution of a thread return instruction to complete a created fiber or thread.
169. The method of claim 137, further comprising:
executing a wait or non-wait fiber join instruction using the processor core.
170. The method of claim 137, further comprising:
and executing all the fiber adding instructions by using the processor core.
171. The method of claim 137, further comprising:
using the processor core, a non-cache read or load instruction is executed to specify a general purpose register for storing data received from memory.
172. The method of claim 137, further comprising:
using the processor core, a non-cache write or store instruction is executed to designate data in a general purpose register for storage in memory.
173. The method of claim 137, further comprising:
using the core control circuitry, a transaction identifier is assigned to any load or store request to memory and is correlated with a thread identifier.
174. The method of claim 137, further comprising:
executing, using the processor core, a custom atomic return instruction to complete a thread of execution of a custom atomic operation.
175. The method of claim 137, further comprising:
executing, using the processor core, a floating point atomic memory operation.
176. The method of claim 137, further comprising:
using the processor core, performing a custom atomic memory operation.
177. The method of claim 137, further comprising:
providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of configurable circuits of the plurality of configurable circuits; and
providing a plurality of spoke instructions and a datapath configuration instruction index to select a master synchronization input of the synchronization network input using a second instruction and instruction index store.
178. The method of claim 137, further comprising:
providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of configurable circuits of the plurality of configurable circuits; and
providing a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit using a second instruction and instruction index memory.
179. The method of claim 137, further comprising:
providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of configurable circuits of the plurality of configurable circuits; and
providing a plurality of spoke instructions and a data path configuration instruction index to select a next data path configuration instruction for a next configurable circuit of the plurality of configurable circuits using a second instruction and instruction index memory.
180. The method of claim 137, further comprising:
using a conditional logic circuit, providing a conditional branch by modifying a next data path instruction or a next data path instruction index provided to a next configurable circuit of the plurality of configurable circuits in dependence upon an output from a configurable circuit of the plurality of configurable circuits.
181. The method of claim 137, further comprising:
a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.
182. The method of claim 137, further comprising:
a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of a current thread is stored using a plurality of control registers to provide in-order thread execution.
183. The method of claim 137, further comprising:
storing, using a plurality of control registers, a completion table having a first data completion count; and
using thread control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
184. The method of claim 137, further comprising:
providing, using a first instruction memory, a plurality of data paths using a configuration instruction to configure a data path of a first configurable circuit of the plurality of configurable circuits;
providing a plurality of spoke instructions and datapath configuration instruction indices using a second instruction and instruction index memory to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the first configurable circuit, and select a next datapath instruction or a next datapath instruction index of a second next configurable circuit of the plurality of configurable circuits;
providing, using a plurality of control registers, a completion table having a first data completion count; and
using thread control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
185. The method of claim 184, further comprising:
providing a plurality of spoke instructions and a data path configuration instruction index to select a synchronized network output of the plurality of synchronized network outputs using the second instruction and instruction index memory.
186. The method of claim 184, further comprising:
providing, using a configuration memory multiplexer, a first selection setting to select the current datapath configuration instruction using an instruction index from the second instruction and an instruction index memory.
187. The method of claim 184, further comprising:
providing, using a configuration memory multiplexer, a second selection setting to select the current datapath configuration instruction using an instruction index from a master synchronization input, the second setting different from the first setting.
188. The method of claim 184, further comprising:
providing a plurality of spoke instructions and a data path configuration instruction index to configure a portion of the configurable circuit independent of the current data path instruction using the second instruction and instruction index memory.
189. The method of claim 184, further comprising:
selecting, using a configuration memory multiplexer, a spoke instruction and a data path configuration instruction index of the plurality of spoke instruction and data path configuration instruction indices according to a modular spoke count.
190. The method of claim 137, further comprising:
providing, using a first instruction memory, a plurality of data paths using a configuration instruction to configure a data path of a first configurable circuit of the plurality of configurable circuits;
providing a plurality of spoke instructions and datapath configuration instruction indices using a second instruction and instruction index memory to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the first configurable circuit, and select a next datapath instruction or a next datapath instruction index of a second next configurable circuit of the plurality of configurable circuits;
providing, using a plurality of control registers, a completion table having a first data completion count; and
using thread control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.
191. The method of claim 137, further comprising:
storing, using a plurality of control registers, a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution, and wherein the plurality of control registers further store a top of a stack of thread identifiers; and
each type of thread identifier is allowed access to the private variable for the selected loop.
192. The method of claim 137, further comprising:
storing, using a plurality of control registers, a completion table having a data completion count;
providing, using thread control circuitry, a continuation queue storing one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier; and
using thread control circuitry, a re-entry queue is provided that stores one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers, such that the threads in the re-entry queue execute after a specified spoke count.
193. The method of claim 137, further comprising:
storing, using a plurality of control registers, a pool of thread identifiers and a completion table having a loop count of the number of active loop threads; and
using thread control circuitry, in response to receiving an asynchronous fabric message that returns a thread identifier to the thread identifier pool, decrementing the loop count, and transmitting an asynchronous fabric completion message when the loop count reaches zero.
194. The method of claim 137, further comprising:
providing a conditional branch by modifying a next datapath instruction or a next datapath instruction index using a conditional logic circuit and dependent upon an output from a configurable circuit of the plurality of configurable circuits.
195. The method of claim 137, further comprising:
enabling a conditional logic circuit; and
using the conditional logic circuit and in dependence upon an output from a configurable circuit of the plurality of configurable circuits, specifying the next datapath instruction or datapath instruction index by oring least significant bits of the next datapath instruction with the output from the configurable computation circuit, thereby providing a conditional branch.
196. The method of claim 137, further comprising:
the primary synchronization input is selected using an input multiplexer.
197. The method of claim 137, further comprising:
selecting an output from a configurable circuit of the plurality of configurable circuits using an output multiplexer.
198. The method of claim 137, further comprising:
using an asynchronous fabric state machine coupled to an asynchronous network input queue and an asynchronous network output queue, input data packets received from the asynchronous packet network are decoded and output data packets for transmission over the asynchronous packet network are assembled.
199. The method of claim 137, further comprising:
providing a plurality of direct point-to-point connections coupling adjacent ones of the plurality of configurable circuits using a synchronization network.
200. The method of claim 199, further comprising:
providing a direct path connection between a plurality of input registers and a plurality of output registers using a first configurable circuit of the plurality of configurable circuits.
201. The method of claim 200 wherein the direct path connection provides a direct point-to-point connection for data transfer from a second configurable circuit of the plurality of configurable circuits received over the synchronous network to a third configurable circuit of the plurality of configurable circuits transmitted over the synchronous network.
202. The method of claim 137, further comprising:
performing, using a configurable circuit of the plurality of configurable circuits, at least one integer or floating point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.
203. The method of claim 137, further comprising:
performing, using a configurable circuit of the plurality of configurable circuits, at least one integer or floating point operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.
204. The method of claim 137, further comprising:
using a scheduling interface circuit, a second work descriptor packet is received over the first interconnection network, and in response to the second work descriptor packet, one or more data and control packets to the plurality of configurable circuits are generated to configure the plurality of configurable circuits to perform selected computations.
205. The method of claim 137, further comprising:
a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.
206. The method of claim 205, wherein each asynchronous network output queue stops outputting data packets on an asynchronous packet network in response to the stop signal.
207. The method of claim 205, wherein each configurable circuit of the plurality of configurable circuits stops executing upon completion of its current instruction in response to the stop signal.
208. The method of claim 137, further comprising:
coupling a first plurality of the plurality of configurable circuits in a first predetermined order through a synchronization network to form a first synchronization domain; and
coupling a second plurality of configurable circuits of the plurality of configurable circuits in a second predetermined order through the synchronization network to form a second synchronization domain.
209. The method of claim 207, further comprising:
generating a continuation message from the first synchronous domain to the second synchronous domain for transmission over the asynchronous packet network.
210. The method of claim 207, further comprising:
generating a completion message from the second synchronous domain to the first synchronous domain for transmission over the asynchronous packet network.
211. The method of claim 137, further comprising:
a completion table having a first data completion count and a second iteration count is stored in a plurality of control registers.
212. The method of claim 137, further comprising:
storing a loop table having a plurality of thread identifiers in a plurality of control registers, and for each thread identifier, storing a next thread identifier for execution after execution of a current thread; and
storing an identification of a first iteration and an identification of a last iteration in the loop table in the plurality of control registers.
213. The method of claim 137, further comprising:
using the control circuitry, the thread is queued for execution when the completion count is decremented to zero for the thread identifier of the thread.
214. The method of claim 137, further comprising:
using the control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.
215. The method of claim 137, further comprising:
using the control circuitry, a thread is queued for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier.
216. The method of claim 214, wherein the completion count indicates, for each selected thread of a plurality of threads, a predetermined number of completion messages received before execution of the selected thread.
217. The method of claim 137, further comprising:
A completion table having a plurality of types of thread identifiers is stored in the plurality of control registers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.
218. The method of claim 216, further comprising:
storing, in the plurality of control registers, a completion table having a loop count of the number of active loop threads, and wherein in response to receiving an asynchronous fabric message that returns a thread identifier to a thread identifier pool, the loop count is decremented using the control circuitry and the asynchronous fabric completion message is transmitted when the loop count reaches zero.
219. The method of claim 216, further comprising:
storing the top of the thread identifier stack in the plurality of control registers to allow each type of thread identifier to access the private variable for the selected loop.
220. The method of claim 137, further comprising:
using the continue queue, one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier are stored.
221. The method of claim 219, further comprising:
using the re-entry queue, one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers are stored.
222. The method of claim 220, further comprising:
executing any threads having thread identifiers in the re-enqueue before executing any threads having thread identifiers in the resume queue.
223. The method of claim 221, further comprising:
executing any threads having a thread identifier in a priority queue prior to executing any threads having a thread identifier in the continue queue or the re-enter queue.
224. The method of claim 137, further comprising:
any thread in the run queue is executed after the spoke count for the thread identifier occurs.
225. The method of claim 137, further comprising:
using the control circuitry, computing threads are self-scheduled for execution.
226. The method of claim 223, further comprising:
using the control circuitry, the computing threads are ordered for execution.
227. The method of claim 223, further comprising:
using the control circuitry, the loop computation thread is ordered for execution.
228. The method of claim 223, further comprising:
using the control circuitry, execution of a computing thread is commenced in response to one or more completion signals from the data dependency.
229. A configurable circuit, comprising:
a configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry; and
a configuration memory coupled to the configurable computing circuitry, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory comprising:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
230. A configurable circuit, comprising:
a configurable computing circuit; and
a configuration memory coupled to the configurable computing circuitry, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory comprising:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit.
231. A configurable circuit, comprising:
a configurable computing circuit; and
a configuration memory coupled to the configurable computing circuitry, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory comprising:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices to select a next data path configuration instruction for a next configurable computational circuit.
232. A configurable circuit, comprising:
a configurable computing circuit;
a control circuit coupled to the configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry; and
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
233. A configurable circuit, comprising:
a configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry; and
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and
a control circuit coupled to the configurable computing circuit, the control circuit comprising:
a memory control circuit;
a thread control circuit; and
a plurality of control registers.
234. A configurable circuit, comprising:
a configurable computing circuit;
a configuration memory coupled to the configurable computing circuitry, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory comprising:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices to select a next data path instruction or a next data path instruction index for a next configurable computational circuit;
and
a conditional logic circuit coupled to the configurable computation circuit, wherein the conditional logic circuit is to provide a conditional branch by modifying the next data path instruction or next data path instruction index provided on a selected output of the plurality of synchronous network outputs as a function of an output from the configurable computation circuit.
235. A configurable circuit, comprising:
a configurable computing circuit;
a control circuit coupled to the configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry;
an asynchronous network input queue coupled to an asynchronous packet network and the first memory circuit;
an asynchronous network output queue; and
a flow control circuit coupled to the asynchronous network output queue, the flow control circuit to generate a stop signal when a predetermined threshold is reached in the asynchronous network output queue.
236. A configurable circuit, comprising:
a configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry; and
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and
a control circuit coupled to the configurable computing circuit, the control circuit comprising:
a memory control circuit;
a thread control circuit; and
a plurality of control registers, wherein the plurality of control registers store a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of a current thread to provide in-order thread execution.
237. A configurable circuit, comprising:
a configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry; and
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and
a control circuit coupled to the configurable computing circuit, the control circuit comprising:
a memory control circuit;
a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and
thread control circuitry to queue a thread for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
238. A configurable circuit, comprising:
a configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs and outputs coupled to the configurable computing circuitry;
an asynchronous network input queue and an asynchronous network output queue;
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the second configuration memory comprising:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing: a plurality of spoke instructions and datapath configuration instruction indices for selecting a master synchronization input of the synchronization network input, for selecting a current datapath configuration instruction of the configurable computation circuit, and for selecting a next datapath instruction or a next datapath instruction index of a next configurable computation circuit; and
a control circuit coupled to the configurable computing circuit, the control circuit comprising:
a memory control circuit;
a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and
thread control circuitry to queue a thread for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
239. A configurable circuit, comprising:
a configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry; and
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and
a control circuit coupled to the configurable computing circuit, the control circuit comprising:
a memory control circuit;
a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and
thread control circuitry for queuing a thread for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread and its thread identifier is the next thread.
240. A configurable circuit, comprising:
a configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry; and
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and
a control circuit coupled to the configurable computing circuit, the control circuit comprising:
a memory control circuit;
a thread control circuit; and
a plurality of control registers storing a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution, and wherein the plurality of control registers further stores a top of a stack of thread identifiers to allow each type of thread identifier to access a private variable for a selected loop.
241. A configurable circuit, comprising:
a configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry; and
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and
a control circuit coupled to the configurable computing circuit, the control circuit comprising:
a memory control circuit;
a plurality of control registers; and
thread control circuitry, comprising:
a resume queue that stores one or more thread identifiers for computing threads that have completion counts allowed for execution but do not yet have an assigned thread identifier; and
a re-entry queue that stores one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers such that the threads in the re-entry queue execute after a specified spoke count.
242. A configurable circuit, comprising:
a configurable computing circuit;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronous network inputs coupled to the configurable computing circuitry;
a plurality of synchronous network outputs coupled to the configurable computing circuitry;
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and
a control circuit coupled to the configurable computing circuit, the control circuit comprising:
a memory control circuit;
a plurality of control registers storing a pool of thread identifiers and a completion table having a loop count of an active loop thread number; and
thread control circuitry, wherein in response to receiving an asynchronous fabric message that returns a thread identifier to the thread identifier pool, the control circuitry decrements the loop count and transmits an asynchronous fabric completion message when the loop count reaches zero.
243. A system, comprising:
an asynchronous packet network;
a synchronization network; and
a plurality of configurable circuits arranged in an array, each configurable circuit of the plurality of configurable circuits being simultaneously coupled to the synchronous network and the asynchronous packet network, the plurality of configurable circuits being configured to form a plurality of synchronous domains using the synchronous network to perform a plurality of computations, and the plurality of configurable circuits being further configured to generate and transmit a plurality of control messages over the asynchronous packet network, the plurality of control messages including one or more completion messages and continuation messages.
244. A system, comprising:
a plurality of configurable circuits arranged in an array;
a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; and
an asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array.
245. A system, comprising:
an interconnection network;
a processor coupled to the interconnection network; and
a plurality of configurable circuit groups coupled to the interconnection network.
246. A system, comprising:
an interconnection network;
a processor coupled to the interconnection network;
a host interface coupled to the interconnection network; and
a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising:
a plurality of configurable circuits arranged in an array;
a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array;
an asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array;
a memory interface coupled to the asynchronous packet network and the interconnection network; and
a scheduling interface coupled to the asynchronous packet network and the interconnection network.
247. A system, comprising:
a hierarchical interconnect network including a first plurality of crossbars having a folded Clos configuration and a plurality of direct mesh connections at interfaces with endpoints;
a processor coupled to the interconnection network;
a host interface coupled to the interconnection network; and
a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising:
a plurality of configurable circuits arranged in an array;
a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array and providing a plurality of direct connections between adjacent configurable circuits of the array;
an asynchronous packet network comprising a second plurality of crossbars, each crossbar coupled to at least one configurable circuit of the plurality of configurable circuits of the array and another crossbar of the second plurality of crossbars;
a memory interface coupled to the asynchronous packet network and the interconnection network; and
a scheduling interface coupled to the asynchronous packet network and the interconnection network.
248. A system, comprising:
an interconnection network;
a processor coupled to the interconnection network;
a host interface coupled to the interconnection network; and
a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising:
a synchronization network;
an asynchronous packet network;
a memory interface coupled to the asynchronous packet network and the interconnection network;
a scheduling interface coupled to the asynchronous packet network and the interconnection network; and
a plurality of configurable circuits arranged in an array, each configurable circuit comprising:
a configurable computing circuit;
a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit;
a thread control circuit; and a plurality of control registers;
a first memory circuit coupled to the configurable computing circuit;
a plurality of synchronization network inputs and outputs coupled to the configurable computing circuitry and the synchronization network;
an asynchronous network input queue and an asynchronous network output queue coupled to the asynchronous packet network;
a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
249. The configurable circuit or system of any of preceding claims 227 to 246, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computational circuit.
250. The configurable circuit or system of any of preceding claims 227 to 247, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.
251. The configurable circuit or system of any one of the preceding claims 227 to 248, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and a data path configuration instruction index to select a synchronous network output of the plurality of synchronous network outputs.
252. The configurable circuit or system of any of preceding claims 227-249, further comprising:
a configuration memory multiplexer coupled to the first instruction memory and the second instruction and instruction index memory.
253. The configurable circuit or system of any one of preceding claims 227 to 250, wherein the current datapath configuration instruction is selected using an instruction index from the second instruction and instruction index memory when a select input of the configuration memory multiplexer has a first setting.
254. The configurable circuit or system of any one of preceding claims 227-251, wherein the current datapath configuration instruction is selected using an instruction index from the master synchronization input when the select input of the configuration memory multiplexer has a second setting that is different from the first setting.
255. The configurable circuit or system of any of preceding claims 227 to 252, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and a data path configuration instruction index to configure portions of the configurable circuit independent of the current data path instruction.
256. The configurable circuit or system of any one of preceding claims 227 to 253, wherein a selected spoke instruction and data path configuration instruction index of the plurality of spoke instruction and data path configuration instruction indices is selected according to a modular spoke count.
257. The configurable circuit or system of any one of the preceding claims 227-254, further comprising:
a conditional logic circuit coupled to the configurable computing circuit.
258. The configurable circuit or system of any one of the preceding claims 227 to 255, wherein the condition logic circuitry is to modify the next datapath instruction index provided on a selected one of the plurality of synchronous network outputs in dependence on an output from the configurable computation circuitry.
259. The configurable circuit or system of any one of the preceding claims 227 to 256, wherein the condition logic circuitry is to provide conditional branching by modifying the next data path instruction or next data path instruction index provided on a selected output of the plurality of synchronous network outputs in dependence on an output from the configurable computation circuitry.
260. The configurable circuit or system of any of preceding claims 227 to 257, wherein when enabled, the condition logic circuitry is to specify the next datapath instruction or datapath instruction index by oring least significant bits of the next datapath instruction with the output from the configurable computation circuitry, providing a conditional branch.
261. The configurable circuit or system of any one of the preceding claims 227 to 258, wherein when enabled, the condition logic circuitry is to specify the next datapath instruction index by oring least significant bits of the next datapath instruction index with the output from the configurable computation circuitry, providing a conditional branch.
262. The configurable circuit or system of any one of preceding claims 227-259, wherein the plurality of synchronous network inputs comprises:
a plurality of input registers coupled to a plurality of communication lines of the synchronous network; and
an input multiplexer coupled to the plurality of input registers and the second instruction and instruction index memory to select the master synchronization input.
263. The configurable circuit or system of any of preceding claims 227 to 260, wherein the plurality of synchronous network outputs comprises:
a plurality of output registers coupled to a plurality of communication lines of the synchronous network; and
an output multiplexer coupled to the configurable computing circuitry to select an output from the configurable computing circuitry.
264. The configurable circuit or system of any one of preceding claims 227-261, further comprising:
an asynchronous fabric state machine coupled to the asynchronous network input queue and the asynchronous network output queue, the asynchronous fabric state machine to decode input packets received from the asynchronous packet network and assemble output packets for transmission over the asynchronous packet network.
265. The configurable circuit or system of any of preceding claims 227 to 262, wherein the asynchronous packet network comprises a plurality of crossbars, each crossbar coupled to a plurality of configurable circuits and at least one other crossbar.
266. The configurable circuit or system of any of the preceding claims 227-263, further comprising:
an array of a plurality of configurable circuits, wherein:
each configurable circuit is coupled to the synchronization network through the plurality of synchronization network inputs and the plurality of synchronization network outputs; and is
Each configurable circuit is coupled to the asynchronous packet network through the asynchronous network input and the asynchronous network output.
267. The configurable circuit or system of any of preceding claims 227 to 264, wherein the synchronization network comprises a plurality of direct point-to-point connections coupling adjacent configurable circuits in the array of the plurality of configurable circuits.
268. The configurable circuit or system of any one of preceding claims 227 to 265, wherein each configurable circuit further comprises:
a direct path connection between the plurality of input registers and the plurality of output registers.
269. The configurable circuit or system of any one of preceding claims 227 to 266, wherein the direct path connection provides a direct point-to-point connection for data transfer from a second configurable circuit received over the synchronous network to a third configurable circuit transmitted over the synchronous network.
270. The configurable circuit or system of any one of the preceding claims 227 to 267, wherein the configurable computation circuitry comprises arithmetic, logic and bit operation circuitry for performing at least one integer operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.
271. The configurable circuit or system of any one of the preceding claims 227 to 268, wherein the configurable computing circuitry comprises arithmetic, logical and bit-arithmetic circuitry for performing at least one floating-point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater than or equal to, signed and unsigned less than or equal to, equal to or unequal comparison, logical AND operation, logical OR operation, logical XOR operation, logical NAND operation, logical NOR operation, logical XOR operation, logical NAND operation, integer and floating point transitions, and combinations thereof.
272. The configurable circuit or system of any one of preceding claims 227 to 269, wherein the configurable computation circuitry comprises multiply and shift operation circuitry for performing at least one integer operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.
273. The configurable circuit or system of any one of the preceding claims 227 to 270, wherein the configurable computation circuitry comprises multiply and shift operation circuitry for performing at least a floating-point operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.
274. The configurable circuit or system of any of preceding claims 227-271, wherein the array of the plurality of configurable circuits is further coupled to a first interconnection network.
275. The configurable circuit or system of any of preceding claims 227-272, wherein the array of the plurality of configurable circuits further comprises:
a third system memory interface circuit; and
a scheduling interface circuit.
276. The configurable circuit or system of any of preceding claims 227 to 273, wherein the scheduling interface circuit is to receive a work descriptor packet over the first interconnection network and, in response to the work descriptor packet, generate one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform selected calculations.
277. The configurable circuit or system of any one of the preceding claims 227-274, further comprising:
a flow control circuit coupled to the asynchronous network output queue, the flow control circuit to generate a stop signal when a predetermined threshold is reached in the asynchronous network output queue.
278. The configurable circuit or system of any one of preceding claims 227 to 275, wherein each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal.
279. The configurable circuit or system of any one of preceding claims 227 to 276, wherein each configurable computing circuit stops executing after completion of its current instruction in response to the stop signal.
280. The configurable circuit or system of any of preceding claims 227-277, wherein a first plurality of configurable circuits in the array of a plurality of configurable circuits is coupled in a first predetermined order through the synchronization network to form a first synchronization domain; and wherein a second plurality of configurable circuits in the array of configurable circuits is coupled in a second predetermined order through the synchronization network to form a second synchronization domain.
281. The configurable circuit or system of any of preceding claims 227-278, wherein the first synchronization domain is to generate a continue message for transmission over the asynchronous packet network to the second synchronization domain.
282. The configurable circuit or system of any of preceding claims 227 to 279, wherein the second synchronization domain is to generate a completion message for transmission over the asynchronous packet network to the first synchronization domain.
283. The configurable circuit or system of any one of preceding claims 227 to 280, wherein the plurality of control registers store a completion table having a first data completion count.
284. The configurable circuit or system of any one of preceding claims 227-281, wherein the plurality of control registers further store the completion table with a second iteration count.
285. The configurable circuit or system of any one of the preceding claims 227 to 282, wherein the plurality of control registers further store a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after a current thread is executed.
286. The configurable circuit or system of any one of the preceding claims 227 to 283, wherein the plurality of control registers further store an identification of a first iteration and an identification of a last iteration in the loop table.
287. The configurable circuit or system of any one of preceding claims 227 to 283, wherein the control circuitry is to queue a thread for execution when a thread identifier for the thread, a completion count for the thread is decremented to zero and its thread identifier is a next thread.
288. The configurable circuit or system of any one of preceding claims 227 to 284, wherein the control circuitry is to queue a thread for execution when a completion count for the thread indicates completion of any data dependencies for the thread identifier of the thread.
289. The configurable circuit or system of any one of preceding claims 227 to 285, wherein the completion count indicates a predetermined number of completion messages received for each selected thread of a plurality of threads before execution of the selected thread.
290. The configurable circuit or system of any one of the preceding claims 227 to 286, wherein the plurality of control registers further store a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.
291. The configurable circuit or system of any one of the preceding claims 227 to 287, wherein the plurality of control registers further store a completion table having a loop count for the number of active loop threads, and wherein in response to receiving an asynchronous fabric message that returns a thread identifier to a thread identifier pool, the control circuitry decrements the loop count and transmits an asynchronous fabric completion message when the loop count reaches zero.
292. The configurable circuit or system of any one of the preceding claims 227 to 288, wherein the plurality of control registers further store a top of a stack of thread identifiers to allow each type of thread identifier to access a private variable for a selected loop.
293. The configurable circuit or system of any of preceding claims 227-289, wherein the control circuit further comprises:
continuing the queue; and
and re-entering the queue.
294. The configurable circuit or system of any one of the preceding claims 227 to 290, wherein the continuation queue stores one or more thread identifiers for computing threads that have completion counts allowed for execution but do not yet have an assigned thread identifier.
295. The configurable circuit or system of any one of the preceding claims 227 to 291, wherein the re-enqueue stores one or more thread identifiers for computing threads having completion counts allowed for execution and having an assigned thread identifier.
296. The configurable circuit or system of any one of preceding claims 227 to 292, wherein any thread in the re-entry queue having a thread identifier is executed before any thread in the continue queue having a thread identifier is executed.
297. The configurable circuit or system of any one of the preceding claims 227-293, wherein the control circuit further comprises:
a priority queue, wherein any thread in the priority queue having a thread identifier executes before any thread in the resume queue or the re-entry queue having a thread identifier executes.
298. The configurable circuit or system of any of preceding claims 227-294, wherein the control circuit further comprises:
a run queue, wherein any thread in the run queue having a thread identifier executes after a spoke count at which the thread identifier occurs.
299. The configurable circuit or system of any of preceding claims 227-295, wherein the second configuration memory circuit comprises:
a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
300. The configurable circuit or system of any of the preceding claims 227 to 296, wherein the control circuitry is to self-schedule a computing thread for execution.
301. The configurable circuit or system of any of preceding claims 227-297, wherein the condition logic circuitry is to branch to a different second next instruction for execution by a next configurable circuit.
302. The configurable circuit or system of any one of preceding claims 227 to 298, wherein the control circuitry is to order computational threads for execution.
303. The configurable circuit or system of any one of the preceding claims 227 to 299, wherein the control circuit is to order loop computation threads for execution.
304. The configurable circuit or system of any one of preceding claims 227 to 300, wherein the control circuitry is to begin executing a computational thread in response to one or more completion signals from data dependencies.
305. A method of configuring a configurable circuit, comprising:
providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
providing a plurality of spoke instructions and a datapath configuration instruction index to select a master synchronization input of the synchronization network input using a second instruction and instruction index store.
306. A method of configuring a configurable circuit, comprising:
providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
providing a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit using a second instruction and instruction index memory.
307. A method of configuring a configurable circuit, comprising:
providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and
a plurality of spoke instructions and datapath configuration instruction indices are provided to select a next datapath configuration instruction of a next configurable computational circuit using a second instruction and instruction index memory.
308. A method of controlling thread execution of a multi-threaded configurable circuit, the configurable circuit having configurable computing circuitry, the method comprising:
using conditional logic circuitry, providing a conditional branch by modifying the next data path instruction or a next data path instruction index provided to a next configurable circuit in dependence upon an output from the configurable computation circuitry.
309. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:
a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.
310. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:
a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of a current thread is stored using a plurality of control registers to provide in-order thread execution.
311. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:
storing, using a plurality of control registers, a completion table having a first data completion count; and
using thread control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
312. A method of configuring and controlling thread execution of multithreaded configurable circuitry having configurable computing circuitry, the method comprising:
providing, using a first instruction memory, a plurality of configuration instructions to configure a data path of the configurable computing circuit;
providing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the configurable computing circuit, and select a next datapath instruction or a next datapath instruction index of a next configurable computing circuit using a second instruction and instruction index memory;
providing, using a plurality of control registers, a completion table having a first data completion count; and
using thread control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
313. A method of configuring and controlling thread execution of multithreaded configurable circuitry having configurable computing circuitry, the method comprising:
providing, using a first instruction memory, a plurality of configuration instructions to configure a data path of the configurable computing circuit;
providing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the configurable computing circuit, and select a next datapath instruction or a next datapath instruction index of a next configurable computing circuit using a second instruction and instruction index memory;
providing, using a plurality of control registers, a completion table having a first data completion count; and
using thread control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.
314. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:
storing, using a plurality of control registers, a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution, and wherein the plurality of control registers further store a top of a stack of thread identifiers; and
each type of thread identifier is allowed access to the private variable for the selected loop.
315. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:
storing, using a plurality of control registers, a completion table having a data completion count;
providing, using thread control circuitry, a continuation queue storing one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier; and
using thread control circuitry, a re-entry queue is provided that stores one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers, such that the threads in the re-entry queue execute after a specified spoke count.
316. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:
storing, using a plurality of control registers, a pool of thread identifiers and a completion table having a loop count of the number of active loop threads; and
using thread control circuitry, in response to receiving an asynchronous fabric message that returns a thread identifier to the thread identifier pool, decrementing the loop count, and transmitting an asynchronous fabric completion message when the loop count reaches zero.
317. The method of any of the preceding claims 302-313, further comprising:
providing, using the second instruction and instruction index memory, a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit.
318. The method of any of the preceding claims 302-314, further comprising:
providing, using the second instruction and instruction index memory, a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.
319. The method of any preceding claim 302-315, further comprising:
providing a plurality of spoke instructions and a data path configuration instruction index to select a synchronized network output of the plurality of synchronized network outputs using the second instruction and instruction index memory.
320. The method of any of the preceding claims 302-316, further comprising:
providing, using a configuration memory multiplexer, a first selection setting to select the current datapath configuration instruction using an instruction index from the second instruction and an instruction index memory.
321. The method of any one of the preceding claims 302-317, further comprising:
providing, using a configuration memory multiplexer, a second selection setting to select the current datapath configuration instruction using an instruction index from a master synchronization input, the second setting different from the first setting.
322. The method of any one of the preceding claims 302-318, further comprising:
providing a plurality of spoke instructions and a data path configuration instruction index to configure a portion of the configurable circuit independent of the current data path instruction using the second instruction and instruction index memory.
323. The method of any of the preceding claims 302-319, further comprising:
selecting, using a configuration memory multiplexer, a spoke instruction and a data path configuration instruction index of the plurality of spoke instruction and data path configuration instruction indices according to a modular spoke count.
324. The method of any of the preceding claims 302-320, further comprising:
modifying the next datapath instruction or next datapath instruction index using conditional logic circuitry and in dependence upon output from the configurable computation circuitry.
325. The method of any of the preceding claims 302-321, further comprising:
providing a conditional branch by modifying the next datapath instruction or next datapath instruction index using conditional logic circuitry and in dependence upon output from the configurable computation circuitry.
326. The method of any of the preceding claims 302-322, further comprising:
enabling a conditional logic circuit; and
using the conditional logic circuit and in dependence upon an output from the configurable computation circuit, specifying the next datapath instruction or datapath instruction index by oring the least significant bit of the next datapath instruction with the output from the configurable computation circuit, thereby providing a conditional branch.
327. The method of any preceding claim 302-323, further comprising:
selecting the primary synchronization input using an input multiplexer.
328. The method of any of the preceding claims 302-324, further comprising:
selecting an output from the configurable computing circuit using an output multiplexer.
329. The method of any of the preceding claims 302-325, further comprising:
using an asynchronous fabric state machine coupled to an asynchronous network input queue and an asynchronous network output queue, input data packets received from the asynchronous packet network are decoded and output data packets for transmission over the asynchronous packet network are assembled.
330. The method of any of the preceding claims 302-326, further comprising:
providing a plurality of direct point-to-point connections coupling adjacent configurable circuits in the array of the plurality of configurable circuits using the synchronization network.
331. The method of any of the preceding claims 302-327, further comprising:
using the configurable circuit, a direct path connection between a plurality of input registers and a plurality of output registers is provided.
332. The method of any of the preceding claims 302-328, wherein the direct path connection provides a direct point-to-point connection for data transfer from a second configurable circuit received over the synchronous network to a third configurable circuit transmitted over the synchronous network.
333. The method of any one of the preceding claims 302-329, further comprising:
using the configurable computing circuitry, performing at least one integer or floating point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.
334. The method of any of the preceding claims 302-330, further comprising:
using the configurable computing circuitry, performing at least one integer or floating point operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.
335. The method of any of the preceding claims 302-331, further comprising:
using a scheduling interface circuit, receiving a work descriptor packet over the first interconnection network, and in response to the work descriptor packet, generating one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform selected calculations.
336. The method of any of the preceding claims 302-332, further comprising:
a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.
337. The method of any one of the preceding claims 302-333, wherein each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal.
338. The method of any of the preceding claims 302-334, wherein each configurable computing circuit, in response to the stop signal, stops executing upon completion of its current instruction.
339. The method of any of the preceding claims 302-335, further comprising:
coupling a first plurality of configurable circuits in the array of a plurality of configurable circuits in a first predetermined order through the synchronization network to form a first synchronization domain; and
coupling a second plurality of configurable circuits in the array of configurable circuits in a second predetermined order through the synchronization network to form a second synchronization domain.
340. The method of any of the preceding claims 302-336, further comprising:
generating a continuation message from the first synchronous domain to the second synchronous domain for transmission over the asynchronous packet network.
341. The method of any of the preceding claims 302-337, further comprising:
generating a completion message from the second synchronous domain to the first synchronous domain for transmission over the asynchronous packet network.
342. The method of any of the preceding claims 302-338, further comprising:
a completion table having a first data completion count is stored in the plurality of control registers.
343. The method of any one of the preceding claims 302-339, further comprising:
storing the completion table with a second iteration count in the plurality of control registers.
344. The method of any of the preceding claims 302-340, further comprising:
a loop table having a plurality of thread identifiers is stored in the plurality of control registers, and for each thread identifier, a next thread identifier is stored for execution after execution of a current thread.
345. The method of any of the preceding claims 302-341, further comprising:
storing an identification of a first iteration and an identification of a last iteration in the loop table in the plurality of control registers.
346. The method of any of the preceding claims 302-342, further comprising:
using the control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
347. The method of any preceding claim 302-343, further comprising:
using the control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.
348. The method of any of the preceding claims 302-344, further comprising:
using the control circuitry, a thread is queued for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier.
349. The method of any preceding claim 302-345, wherein the completion count indicates, for each selected thread of a plurality of threads, a predetermined number of completion messages received before execution of the selected thread.
350. The method of any of the preceding claims 302-346, further comprising:
a completion table having a plurality of types of thread identifiers is stored in the plurality of control registers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.
351. The method of any one of the preceding claims 302-347, further comprising:
storing, in the plurality of control registers, a completion table having a loop count of the number of active loop threads, and wherein in response to receiving an asynchronous fabric message that returns a thread identifier to a thread identifier pool, the loop count is decremented using the control circuitry and the asynchronous fabric completion message is transmitted when the loop count reaches zero.
352. The method of any of the preceding claims 302-348, further comprising:
storing the top of the thread identifier stack in the plurality of control registers to allow each type of thread identifier to access the private variable for the selected loop.
353. The method of any one of the preceding claims 302-349, further comprising:
using the continue queue, one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier are stored.
354. The method of any of the preceding claims 302-350, further comprising:
using the re-entry queue, one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers are stored.
355. The method of any of the preceding claims 302-351, further comprising:
executing any threads having thread identifiers in the re-enqueue before executing any threads having thread identifiers in the resume queue.
356. The method of any of the preceding claims 302-352, further comprising:
executing any threads having a thread identifier in a priority queue prior to executing any threads having a thread identifier in the continue queue or the re-enter queue.
357. The method of any of the preceding claims 302-354, further comprising:
any thread in the run queue is executed after the spoke count for the thread identifier occurs.
358. The method of any of the preceding claims 302-354, further comprising:
using the control circuitry, computing threads are self-scheduled for execution.
359. The method of any one of the preceding claims 302-355, further comprising:
using the conditional logic circuit, a different second next instruction is branched to for execution by a next configurable circuit.
360. The method of any one of the preceding claims 302-356, further comprising:
using the control circuitry, the computing threads are ordered for execution.
361. The method of any one of the preceding claims 302-357, further comprising:
using the control circuitry, the loop computation thread is ordered for execution.
362. The method of any of the preceding claims 302-358, further comprising:
using the control circuitry, execution of a computing thread is commenced in response to one or more completion signals from the data dependency.
363. A processor, comprising:
a processor core to execute the received instructions; and
core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to received work descriptor data packets.
364. A processor, comprising:
a processor core to execute the received instructions; and
core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to received event data packets.
365. A processor, comprising:
a processor core to execute a shred creation instruction; and
core control circuitry coupled to the processor core, the core control circuitry to automatically schedule the shred creation instructions for execution by the processor core and generate one or more job descriptor data packets destined for another processor or hybrid thread fabric circuitry to execute a corresponding plurality of execution threads.
366. A processor, comprising:
a processor core to execute a shred creation instruction; and
core control circuitry coupled to the processor core, the core control circuitry to schedule the shred creation instructions for execution by the processor core, reserve a predetermined amount of memory space in a thread control memory to store return arguments, and generate one or more job descriptor data packets destined for another processor or a hybrid thread fabric circuitry to execute a corresponding plurality of execution threads.
367. A processor, comprising:
a core control circuit, comprising:
an interconnection network interface;
a thread control memory coupled to the interconnect network interface;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue, the thread control memory; and
an instruction cache coupled to the control logic and thread selection circuitry; and
a processor core coupled to the instruction cache of the core control circuitry.
368. A processor, comprising:
a core control circuit, comprising:
an interconnection network interface;
a thread control memory coupled to the interconnect network interface;
a network response memory;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue, the thread control memory;
an instruction cache coupled to the control logic and thread selection circuitry; and
a command queue; and
a processor core coupled to the instruction cache and the command queue of the core control circuitry.
369. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier to execute the execution thread.
370. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution of instructions of the execution thread by a processor core.
371. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution of instructions of the execution thread by the processor core.
372. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:
a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution by the processor core of instructions of the execution thread, the processor core using data stored in the data cache or general purpose register.
373. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:
a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers;
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, and periodically select the thread identifier for execution by the processor core of instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.
374. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:
a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers;
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration that the valid state remains unchanged, and suspend thread execution by not returning the thread identifier to the execution queue when the thread identifier has a suspended state.
375. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:
a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory; and
control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution of instructions of the execution thread by the processor core.
376. A processor, comprising:
a processor core to execute a plurality of instructions; and
core control circuitry coupled to the processor core, the core control circuitry comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and
an instruction cache coupled to the processor core and the control logic and thread selection circuitry to receive the initial program count and provide a corresponding one of the plurality of instructions to the processor core for execution.
377. A processor, comprising:
a core control circuit, comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, periodically select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread;
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and
a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.
378. A processor, comprising:
a core control circuit, comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread;
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and
a command queue; and
a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.
379. A processor, comprising:
a core control circuit, comprising:
an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and
a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.
380. A processor, comprising:
a core control circuit, comprising:
an interconnection network interface coupleable to the interconnection network to receive the invoke work descriptor packet, decode the received work descriptor packet into an execution thread having an initial program count and any received arguments, and encode the work descriptor packet for transmission to other processing elements;
a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;
an execution queue coupled to the thread control memory;
a network response memory coupled to the interconnect network interface;
control logic and thread selection circuitry coupled to the execution queue, the thread control memory, and the instruction cache, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread;
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and
a command queue storing one or more commands to generate one or more work descriptor packets; and
a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.
381. The processor of any one of the preceding claims 360-377, wherein the core control circuitry comprises:
an interconnection network interface coupleable to an interconnection network, the interconnection network interface to receive a work descriptor packet, decode the received work descriptor packet into an execution thread having an initial program count and any received arguments.
382. The processor of any of the preceding claims 360 to 378, wherein the interconnection network interface is further to receive an event data packet, decode the received event data packet into an event identifier and any received arguments.
383. The processor of any of the preceding claims 360-379, wherein the core control circuitry further comprises:
control logic and thread selection circuitry coupled to the interconnect network interface, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread.
384. The processor of any one of the preceding claims 360-380, wherein the core control circuitry further comprises:
a thread control memory having a plurality of registers, the plurality of registers comprising:
a thread identifier pool register to store a plurality of thread identifiers.
385. The processor of any one of preceding claims 360-381, wherein the thread control memory further comprises:
a thread status register.
386. The processor of any one of the preceding claims 360-382, wherein the thread control memory further comprises:
a program count register to store the received initial program count.
387. The processor of any of preceding claims 360-383, wherein the thread control memory further comprises:
a general register to store the received argument.
388. The processor of any one of the preceding claims 360-384, wherein the thread control memory further comprises:
the pending fiber returns to the count register.
389. The processor of any of the preceding claims 360-385, wherein the thread control memory further comprises:
returning to the argument buffer or register.
390. The processor of any one of the preceding claims 360-386, wherein the thread control memory further comprises:
and returning to the argument linked list register.
391. The processor of any of the preceding claims 360-387, wherein the thread control memory further comprises:
a custom atomic transaction identifier register.
392. The processor of any one of the preceding claims 360-388, wherein the thread control memory further comprises:
and (6) caching data.
393. The processor of any of preceding claims 360-389, wherein the interconnection network interface is further to store the execution thread having the initial program count and any received arguments in the thread control memory using a thread identifier as an index to the thread control memory.
394. The processor of any one of the preceding claims 360-390, wherein the core control circuitry further comprises:
control logic and thread selection circuitry coupled to the thread control memory and the interconnect network interface, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread.
395. The processor of any one of the preceding claims 360-391, wherein the core control circuitry further comprises:
an execution queue coupled to the thread control memory, the execution queue storing one or more thread identifiers.
396. The processor of any of the preceding claims 360-392, wherein the core control circuitry further comprises:
control logic and thread selection circuitry coupled to the execution queue, the interconnect network interface, and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, and access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread.
397. The processor of any one of the preceding claims 360-393, wherein the core control circuit further comprises:
an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide a corresponding instruction for execution.
398. The processor of any one of the preceding claims 360-394, wherein the processor further comprises:
a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.
399. The processor of any of the preceding claims 360-395, wherein the core control circuitry is further to assign an initial valid state to the execution thread.
400. The processor of any one of the preceding claims 360-396, wherein the core control circuitry is further to allocate a halt state to the execution thread in response to the processor core executing a memory load instruction.
401. The processor of any one of the preceding claims 360-397, wherein the core control circuitry is further to allocate a halt state to the execution thread in response to the processor core executing a memory store instruction.
402. The processor of any one of the preceding claims 360-398, wherein the core control circuitry is further for ending execution of a selected thread in response to the processor core executing a return instruction.
403. The processor of any one of the preceding claims 360-399, wherein the core control circuitry is further to return a corresponding thread identifier for the selected thread to the thread identifier pool register in response to the processor core executing a return instruction.
404. The processor of any one of the preceding claims 360-400, wherein the core control circuitry is further to clear the register of the thread control memory indexed by the corresponding thread identifier of the selected thread in response to the processor core executing a return instruction.
405. The processor of any one of the preceding claims 360-401, wherein the interconnection network interface is further to generate a return job descriptor packet in response to the processor core executing a return instruction.
406. The processor of any of the preceding claims 360-402, wherein the core control circuitry further comprises:
the network response memory.
407. The processor according to any of the preceding claims 360-403, wherein the network response memory comprises:
a memory request register.
408. The processor of any of preceding claims 360-404, wherein the network response memory comprises:
a thread identifier and a transaction identifier register.
409. The processor of any of the preceding claims 360-405, wherein the network response memory comprises:
the cache line index register is requested.
410. The processor of any one of the preceding claims 360-406, wherein the network response memory comprises:
a byte register.
411. The processor of any one of the preceding claims 360-407, wherein the network response memory comprises:
a general register index and a type register.
412. The processor of any one of the preceding claims 360-408, wherein the thread control memory further comprises:
an event status register.
413. The processor of any one of the preceding claims 360-409, wherein the thread control memory further comprises:
an event mask register.
414. The processor of any one of the preceding claims 360 to 410, wherein the interconnection network interface is to generate a point-to-point event data message.
415. The processor of any one of the preceding claims 360 to 411, wherein the interconnection network interface is to generate a broadcast event data message.
416. The processor of any of the preceding claims 360-412, wherein the core control circuitry is further to respond to a received event data packet with an event mask stored in the event mask register.
417. The processor of any of the preceding claims 360-413, wherein the core control circuitry is further to determine an event number corresponding to a received event data packet.
418. The processor of any of the preceding claims 360-414 wherein the core control circuitry is further to change the state of a thread identifier from suspended to active to resume execution of a corresponding thread of execution in response to a received event data packet.
419. The processor of any of the preceding claims 360-415, wherein the core control circuitry is further to change a state of a thread identifier from suspended to active to resume execution of a corresponding thread of execution in response to an event number of a received event data packet.
420. The processor of any one of the preceding claims 360-417, wherein the control logic and thread selection circuitry are further to successively select a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread.
421. The processor of any one of the preceding claims 360 to 418, wherein the control logic and thread selection circuitry are further to perform a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.
422. The processor of any one of the preceding claims 360 to 419, wherein the control logic and thread selection circuitry are further to perform a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread until the execution thread is completed.
423. The processor of any one of preceding claims 360-420, wherein the control logic and thread selection circuitry are further to perform barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.
424. The processor of any one of the preceding claims 360-421, wherein the control logic and thread selection circuitry are further to assign an active state or a suspended state to a thread identifier.
425. The processor of any one of the preceding claims 360-422, wherein the control logic and thread selection circuitry are further to assign a priority status to a thread identifier.
426. The processor of any one of the preceding claims 360-423, wherein the control logic and thread selection circuitry are further to return the corresponding thread identifier having an assigned valid state and an assigned priority to the execution queue after execution of a corresponding instruction.
427. The processor of any one of the preceding claims 360-425, wherein the core control circuitry further comprises:
a network command queue coupled to the processor core.
428. The processor of any of the preceding claims 360-426, wherein the interconnection network interface comprises:
inputting a queue;
a packet decoder circuit coupled to the input queue, the control logic and thread selection circuit, and the thread control memory;
an output queue; and
a packet encoder circuit coupled to the output queue, the network response memory, and the network command queue.
429. The processor of any one of the preceding claims 360-427, wherein the execution queue further comprises:
a first priority queue; and
a second priority queue.
430. The processor of any one of the preceding claims 360-428, wherein the control logic and thread selection circuitry further comprises:
thread selection control circuitry coupled to the execution queue, the thread selection control circuitry to select a thread identifier from the first priority queue at a first frequency and to select a thread identifier from the second priority queue at a second frequency, the second frequency being lower than the first frequency.
431. The processor of any of the preceding claims 360-429, wherein the thread selection control circuitry is to determine the second frequency as a skip count starting with selecting a thread identifier from the first priority queue.
432. The processor of any one of the preceding claims 360-430, wherein the core control circuitry further comprises:
data path control circuitry for controlling access size through the first interconnection network.
433. The processor of any one of the preceding claims 360-431, wherein the core control circuit further comprises:
data path control circuitry to increase or decrease a memory load access size in response to a time-averaged usage level.
434. The processor of any one of the preceding claims 360-432, wherein the core control circuitry further comprises:
data path control circuitry to increase or decrease a memory storage access size in response to a time-averaged usage level.
435. The processor of any one of the preceding claims 360-433, wherein the control logic and thread selection circuitry are further to increase a size of a memory load access request to correspond to a cache line boundary of the data cache.
436. The processor of any one of the preceding claims 360-434, wherein the core control circuitry further comprises:
system call circuitry to generate one or more system calls to a host processor.
437. The processor of any of the preceding claims 360-435, wherein the system call circuitry further comprises:
a plurality of system call credit registers storing a predetermined credit count to modulate the number of system calls in any predetermined time period.
438. The processor of any of the preceding claims 360-436, wherein the core control circuitry is further to generate a command to cause the command queue of the interconnect network interface to copy and transmit all data corresponding to a selected thread identifier from the thread control memory to monitor thread status in response to a request from a host processor.
439. The processor of any one of the preceding claims 360-437, wherein the processor core is to execute a shred creation instruction to generate one or more commands that are to cause the command queue of the interconnect network interface to generate one or more call work descriptor packets destined for another processor core or a hybrid thread fabric circuit.
440. The processor of any one of the preceding claims 360-438, wherein in response to the processor core executing a shred creation instruction, the core control circuitry is to reserve a predetermined amount of memory space in the general purpose register or a return argument register.
441. The processor of any one of the preceding claims 360-439, wherein in response to generating one or more call job descriptor packets destined for another processor core or a hybrid thread fabric, the core control circuitry is to store a thread return count in the thread return register.
442. The processor of any one of the preceding claims 360-440, wherein in response to receiving a return data packet, the core control circuitry is to decrement the thread return count stored in the thread return register.
443. The processor of any one of the preceding claims 360 to 441, wherein in response to the thread return count in the thread return register decrementing to zero, the core control circuitry is to change the suspended state of the corresponding thread identifier to an active state for subsequent execution of a thread return instruction to complete a created shred or thread.
444. The processor of any one of the preceding claims 360-442, wherein the processor core is to execute a wait or a non-wait fiber join instruction.
445. The processor of any one of the preceding claims 360-443, wherein the processor core is to execute a fibre join instruction.
446. The processor of any one of the preceding claims 360-444, wherein the processor core is to execute a non-cache read or load instruction to specify a general purpose register to store data received from memory.
447. The processor of any one of the preceding claims 360-445, wherein the processor core is to execute a non-cache write or store instruction to specify data in a general purpose register for storage in memory.
448. The processor of any one of preceding claims 360 to 446, wherein the core control circuitry is to allocate a transaction identifier to any load or store request to memory and to correlate the transaction identifier with a thread identifier.
449. The processor of any one of the preceding claims 360-447, wherein the processor core is to execute a first thread priority instruction to assign a first priority to an execution thread having a corresponding thread identifier.
450. The processor of any one of the preceding claims 360-448, wherein the processor core is to execute a second thread priority instruction to assign a second priority to an execution thread having a corresponding thread identifier.
451. The processor of any one of claims 360-449, wherein the processor core is to execute a custom atomic return instruction to complete a thread of execution of a custom atomic operation.
452. The processor of any one of the preceding claims 360-450, wherein in conjunction with a memory controller, the processor core is to perform floating point atomic memory operations.
453. The processor of any one of the preceding claims 360-451, wherein the processor core, in conjunction with a memory controller, is to perform custom atomic memory operations.
454. A method of self-scheduling execution of instructions, comprising:
receiving a work descriptor data packet; and
the instructions are automatically scheduled for execution in response to the received job descriptor data packet.
455. A method of self-scheduling execution of instructions, comprising:
receiving an event data packet; and
the instructions are automatically scheduled for execution in response to the received event data packet.
456. A method of causing a first processing element to generate a plurality of threads of execution for execution by a second processing element, comprising:
executing a fiber program creating instruction; and
in response to executing the shred creation instruction, generating one or more job descriptor data packets destined for the second processing element to execute the plurality of execution threads.
457. A method of causing a first processing element to generate a plurality of threads of execution for execution by a second processing element, comprising:
executing a fiber program creating instruction; and
in response to executing the shred creation instruction, a predetermined amount of memory space is reserved in a thread control memory to store return arguments and one or more work descriptor data packets destined for the second processing element are generated to execute the plurality of execution threads.
458. A method of self-scheduling execution of instructions, comprising:
receiving a work descriptor data packet;
decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments;
assigning an available thread identifier to the execution thread;
automatically queuing the thread identifier for execution of the execution thread; and
the thread identifier is periodically selected to execute the execution thread.
459. A method of self-scheduling execution of instructions, comprising:
receiving a work descriptor data packet;
decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments;
assigning an available thread identifier to the execution thread;
automatically queuing the thread identifier for execution of the execution thread when the thread identifier has a valid state; and
periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.
460. A method of self-scheduling execution of instructions, comprising:
receiving a work descriptor data packet;
decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments;
assigning an available thread identifier to the execution thread;
automatically queuing the thread identifier in an execution queue for execution of the execution thread when the thread identifier has a valid state; and
periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged; and
when the thread identifier has a suspended state, suspending thread execution by not returning the thread identifier to the execution queue.
461. A method of self-scheduling execution of instructions, comprising:
receiving a work descriptor data packet;
decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments;
storing the initial program count and any received arguments in a thread control memory;
assigning an available thread identifier to the execution thread;
automatically queuing the thread identifier for execution of the execution thread when the thread identifier has a valid state;
accessing the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and
periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.
462. The method of any of the preceding claims 453-460, further comprising:
receiving an event data packet; and
the received event data packet is decoded into an event identifier and any received arguments.
463. The method of any of the preceding claims 453-461, further comprising:
an initial valid state is assigned to the execution thread.
464. The method of any of the preceding claims 453-462, further comprising:
a suspend state is assigned to the execution thread in response to executing a memory load instruction.
465. The method of any of the preceding claims 453-463, further comprising:
a suspend state is assigned to the execution thread in response to executing a memory store instruction.
466. The method of any of the preceding claims 453-464, further comprising:
in response to executing the return instruction, execution of the selected thread is terminated.
467. The method of any of the preceding claims 453-465, further comprising:
in response to executing a return instruction, returning a corresponding thread identifier for the selected thread to the thread identifier pool.
468. The method of any of the preceding claims 453-466, further comprising:
in response to executing a return instruction, clearing the register of thread control memory indexed by the corresponding thread identifier of the selected thread.
469. The method of any of the preceding claims 453-467, further comprising:
in response to executing the return instruction, a return job descriptor packet is generated.
470. The method of any of the preceding claims 453-468, further comprising:
a point-to-point event data message is generated.
471. The method of any of the preceding claims 453-469, further comprising:
a broadcast event data message is generated.
472. The method of any of the preceding claims 453-470, further comprising:
the received event data packet is responded to with an event mask.
473. The method of any of the preceding claims 453-471, further comprising:
an event number corresponding to the received event data packet is determined.
474. The method of any of the preceding claims 453-472, further comprising:
in response to the received event data packet, the state of the thread identifier is changed from suspended to active to resume execution of the corresponding execution thread.
475. The method of any of the preceding claims 453-473, further comprising:
in response to the event number of the received event data packet, the state of the thread identifier is changed from suspended to active to resume execution of the corresponding execution thread.
476. The method of any of the preceding claims 453-474, further comprising:
successively selecting a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread.
477. The method of any of the preceding claims 453-475, further comprising:
performing a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers for executing a single instruction of a corresponding execution thread, respectively.
478. The method of any of the preceding claims 453-476, further comprising:
a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers is performed for executing a single instruction of a corresponding execution thread, respectively, until the execution thread is completed.
479. The method of any of the preceding claims 453-477, further comprising:
performing a barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.
480. The method of any of the preceding claims 453-478, further comprising:
a valid state or a suspended state is assigned to the thread identifier.
481. The method of any of the preceding claims 453-479, further comprising:
a priority status is assigned to the thread identifier.
482. The method of any of the preceding claims 453-480, further comprising:
after executing a corresponding instruction, returning the corresponding thread identifier with an assigned valid state and an assigned priority to the execution queue.
483. The method of any of the preceding claims 453-481, further comprising:
thread identifiers from a first priority queue are selected at a first frequency and thread identifiers from a second priority queue are selected at a second frequency, the second frequency being lower than the first frequency.
484. The method of any of the preceding claims 453-482, further comprising:
determining the second frequency as a skip count from selecting a thread identifier from the first priority queue.
485. The method of any of the preceding claims 453-483, further comprising:
controlling the data path access size.
486. The method of any of the preceding claims 453-484, further comprising:
the memory load access size is increased or decreased in response to the time-averaged usage level.
487. The method of any of the preceding claims 453-485, further comprising:
the memory storage access size is increased or decreased in response to the time-averaged usage level.
488. The method of any of the preceding claims 453-486, further comprising:
the size of the memory load access request is increased to correspond to a cache line boundary of the data cache.
489. The method of any of the preceding claims 453-487, further comprising:
one or more system calls are generated to a host processor.
490. The method of any of the preceding claims 453-488, further comprising:
the number of system calls within any predetermined time period is modulated using a predetermined credit count.
491. The method of any of the preceding claims 453-489, further comprising:
all data from the thread control memory corresponding to the selected thread identifier is copied and transferred in response to a request from the host processor to monitor thread status.
492. The method of any of the preceding claims 453-490, further comprising:
the fabric creation instruction is executed to generate one or more commands that generate one or more call job descriptor packets destined for another processor core or a hybrid thread fabric circuit.
493. The method of any one of the preceding claims 453-491, further comprising:
in response to executing the shred creation instruction, a predetermined amount of memory space is reserved to store any return arguments.
494. The method of any of the preceding claims 453-492, further comprising:
storing a thread return count in the thread return register in response to generating one or more call work descriptor packets.
495. The method of any of preceding claims 453-493, wherein the thread return count stored in the thread return register is decremented in response to receipt of a return data packet.
496. The method of any of preceding claims 453-494, wherein in response to the thread return count in the thread return register decrementing to zero, the suspended state of the corresponding thread identifier is changed to an active state for subsequent execution of a thread return instruction to complete the created fibre or thread.
497. The method of any of the preceding claims 453-495, further comprising:
a wait or non-wait fiber join instruction is executed.
498. The method of any of the preceding claims 453-496, further comprising:
and executing the all-fiber adding instruction.
499. The method of any of the preceding claims 453-497, further comprising:
a non-cache read or load instruction is executed to specify a general purpose register for storing data received from memory.
500. The method of any of the preceding claims 453-498, further comprising:
a non-cache write or store instruction is executed to specify data in general purpose registers for storage in memory.
501. The method of any one of the preceding claims 453-499, further comprising:
the transaction identifier is assigned to any load or store request to memory and is correlated with the thread identifier.
502. The method of any of the preceding claims 453-500, further comprising:
a first thread priority instruction is executed to assign a first priority to an execution thread having a corresponding thread identifier.
503. The method of any of the preceding claims 453-501, further comprising:
a second thread priority instruction is executed to assign a second priority to the execution thread having the corresponding thread identifier.
504. The method of any of the preceding claims 453-502, further comprising:
and executing the custom atomic return instruction to complete the execution thread of the custom atomic operation.
505. The method of any of the preceding claims 453-503, further comprising:
a floating point atomic memory operation is performed.
506. The method of any of the preceding claims 453-504, further comprising:
a custom atomic memory operation is performed.
Technical Field
The present invention relates generally to configurable computing circuitry, and more particularly to a heterogeneous computing system including a self-scheduling processor, which is a configurable computing circuitry with embedded interconnect networks and which can be dynamically reconfigured and dynamically controlled for power consumption or power consumption.
Background
Many existing computing systems reach significant limitations in computing processing power in terms of computing speed, energy (or power) consumption, and associated heat dissipation. For example, as the demand for advanced computing technologies continues to grow, such as to accommodate artificial intelligence and other important computing applications, existing computing solutions are becoming less and less adequate.
Therefore, there is a current need for a computing architecture that can provide a high performance and energy saving solution for computationally intensive cores, for example, to compute Fast Fourier Transforms (FFTs) and Finite Impulse Response (FIR) filters for sensing, communication and analysis applications, such as synthetic aperture radar, 5G base stations, and graphic analysis applications, such as, but not limited to, graphic clustering using spectral techniques, machine learning, 5G networking algorithms, and large die codes.
In addition, there is a need for a configurable computing architecture that can be configured for any of these different applications, but most importantly, that is also capable of dynamic self-configuration and self-reconfiguration. Finally, there is a need for a processor architecture that is capable of massive parallel processing and further interacts with and controls a configurable computing architecture to execute any of these various applications.
Disclosure of Invention
As discussed in more detail below, representative apparatus, systems, and methods provide a computing architecture capable of providing a high performance and energy efficient solution for compute intensive cores, for example, to compute Fast Fourier Transforms (FFTs) and Finite Impulse Response (FIR) filters for sensing, communication, and analysis applications, such as synthetic aperture radar, 5G base stations, and graphic analysis applications, such as graphic clustering using spectral techniques, machine learning, 5G networking algorithms, and large die code, but are not so limited.
Notably, the various representative embodiments provide a multi-threaded coarse-grained configurable computing architecture that can be configured for any of these different applications, but most importantly, is also capable of self-scheduling, dynamic self-configuration and self-reconfiguration, conditional branching, backpressure control for asynchronous signaling, ordered and loop thread execution (including data dependencies), automatic start of thread execution after data dependency and/or ordering is completed, providing loop access to private variables, providing fast execution of loop threads using reentrant queues, and advanced loop execution using various thread identifiers, including nested loops.
As also discussed in greater detail below, the representative apparatus, system, and method provide a processor architecture capable of self-scheduling, massively parallel processing, and further interacting with and controlling a configurable computing architecture to execute any of these different applications.
In a representative embodiment, a system comprises: a first interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising: a plurality of configurable circuits arranged in an array; a second asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array; a third synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; a memory interface circuit coupled to the asynchronous packet network and the interconnection network; and scheduling interface circuitry coupled to the asynchronous packet network and the interconnection network.
For any of the various representative embodiments, the interconnection network may include: a first plurality of cross-bar switches having a folded Clos configuration and a plurality of direct mesh connections located at an interface with the system endpoint 935. For any of the various representative embodiments, the asynchronous packet network may comprise: a second plurality of crossbars, each crossbar coupled to at least one configurable circuit of the plurality of configurable circuits of the array and another crossbar of the second plurality of crossbars. For any of the various representative embodiments, the synchronization network may comprise: a plurality of direct point-to-point connections coupling adjacent configurable circuits of the array of the plurality of configurable circuits of the group of configurable circuits.
In a representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a configuration memory coupled to the configurable computing circuitry, control circuitry, the synchronous network input, and the synchronous network output, wherein the configuration memory comprises: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
In a representative embodiment, each configurable circuit of the plurality of configurable circuits comprises: a configurable computing circuit; a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit; a thread control circuit; and a plurality of control registers; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronization network inputs coupled to the configurable computing circuitry and the synchronization network; a plurality of synchronization network outputs coupled to the configurable computing circuitry and the synchronization network; an asynchronous network input queue coupled to the asynchronous packet network; an asynchronous network output queue coupled to the asynchronous packet network; a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
In another representative embodiment, a system may comprise: a first interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising: a plurality of configurable circuits arranged in an array, each configurable circuit comprising: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs and outputs coupled to the configurable computing circuitry; an asynchronous network input queue and an asynchronous network output queue; a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output, the second configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing: a plurality of spoke instructions and datapath configuration instruction indices for selecting a master synchronization input of the synchronization network input, for selecting a current datapath configuration instruction of the configurable computing circuit, and for selecting a next datapath instruction or a next datapath instruction index of a next configurable computing circuit; and control circuitry coupled to the configurable computing circuitry, the control circuitry comprising: a memory control circuit; a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and thread control circuitry for queuing threads for execution.
In another representative embodiment, a system may comprise: a first interconnection network; a host interface coupled to the interconnection network; at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising a plurality of configurable circuits arranged in an array; and a processor coupled to the interconnection network, the processor comprising: a processor core to execute a plurality of instructions; and core control circuitry coupled to the processor core, the core control circuitry comprising: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and an instruction cache coupled to the processor core and the control logic and thread selection circuitry to receive the initial program count and provide a corresponding one of the plurality of instructions to the processor core for execution.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; and a configuration memory coupled to the configurable computing circuitry, control circuitry, synchronous network input, and synchronous network output, the configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; and a configuration memory coupled to the configurable computing circuitry, control circuitry, synchronous network input, and synchronous network output, the configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices to select a next data path configuration instruction for a next configurable computational circuit.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a control circuit coupled to the configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
In yet another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output; and control circuitry coupled to the configurable computing circuitry, the control circuitry comprising: a memory control circuit; a thread control circuit; and a plurality of control registers.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a configuration memory coupled to the configurable computing circuitry, control circuitry, a synchronous network input, and a synchronous network output, the configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices to select a next data path instruction or a next data path instruction index for a next configurable computational circuit; and conditional logic circuitry coupled to the configurable computation circuitry, wherein the conditional logic circuitry is to provide a conditional branch by modifying the next data path instruction or next data path instruction index provided on a selected output of the plurality of synchronous network outputs in dependence upon an output from the configurable computation circuitry.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a control circuit coupled to the configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; an asynchronous network input queue coupled to an asynchronous packet network and the first memory circuit; an asynchronous network output queue; and a flow control circuit coupled to the asynchronous network output queue, the flow control circuit to generate a stop signal when a predetermined threshold is reached in the asynchronous network output queue.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output; and control circuitry coupled to the configurable computing circuitry, the control circuitry comprising: a memory control circuit; a thread control circuit; and a plurality of control registers, wherein the plurality of control registers store a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of a current thread to provide in-order thread execution.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output; and control circuitry coupled to the configurable computing circuitry, the control circuitry comprising: a memory control circuit; a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and thread control circuitry to queue a thread for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs and outputs coupled to the configurable computing circuitry; an asynchronous network input queue and an asynchronous network output queue; a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output, the second configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing: a plurality of spoke instructions and datapath configuration instruction indices for selecting a master synchronization input of the synchronization network input, for selecting a current datapath configuration instruction of the configurable computing circuit, and for selecting a next datapath instruction or a next datapath instruction index of a next configurable computing circuit; and the configurable circuit further comprises a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit; a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and thread control circuitry to queue a thread for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output; and control circuitry coupled to the configurable computing circuitry, the control circuitry comprising: a memory control circuit; a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and thread control circuitry for queuing a thread for execution when, for its thread identifier, the thread's completion count is decremented to zero and its thread identifier is the next thread.
In yet another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output; and the configurable circuit further comprises a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit; a thread control circuit; and a plurality of control registers storing a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution, and wherein the plurality of control registers further store a top of a stack of thread identifiers to allow each type of thread identifier to access a private variable for a selected loop.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output; and control circuitry coupled to the configurable computing circuitry, the control circuitry comprising: a memory control circuit; a plurality of control registers; and thread control circuitry comprising: a resume queue that stores one or more thread identifiers for computing threads that have completion counts allowed for execution but do not yet have an assigned thread identifier; and a re-entry queue that stores one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers, such that the threads in the re-entry queue execute after a specified spoke count.
In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output; and control circuitry coupled to the configurable computing circuitry, the control circuitry comprising: a memory control circuit; a plurality of control registers storing a pool of thread identifiers and a completion table having a loop count of an active loop thread number; and thread control circuitry, wherein in response to receiving an asynchronous fabric message returning a thread identifier to the thread identifier pool, the control circuitry decrements the loop count and transmits an asynchronous fabric completion message when the loop count reaches zero.
In a representative embodiment, a system is disclosed that may include: an asynchronous packet network; a synchronization network; and a plurality of configurable circuits arranged in an array, each configurable circuit of the plurality of configurable circuits being simultaneously coupled to the synchronous network and the asynchronous packet network, the plurality of configurable circuits being configured to form a plurality of synchronous domains using the synchronous network to perform a plurality of computations, and the plurality of configurable circuits being further configured to generate and transmit a plurality of control messages over the asynchronous packet network, the plurality of control messages including one or more completion messages and continuation messages.
In another representative embodiment, a system may comprise: a plurality of configurable circuits arranged in an array; a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; and an asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array.
In another representative embodiment, a system may comprise: an interconnection network; a processor coupled to the interconnection network; and a plurality of groups of configurable circuits coupled to the interconnection network.
In a representative embodiment, a system comprises: an interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising: a plurality of configurable circuits arranged in an array; a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; an asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array; a memory interface coupled to the asynchronous packet network and the interconnection network; and a scheduling interface coupled to the asynchronous packet network and the interconnection network.
In another representative embodiment, a system may comprise: a hierarchical interconnect network including a first plurality of crossbars having a folded Clos configuration and a plurality of direct mesh connections at interfaces with endpoints; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising: a plurality of configurable circuits arranged in an array; a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array and providing a plurality of direct connections between adjacent configurable circuits of the array; an asynchronous packet network comprising a second plurality of crossbars, each crossbar coupled to at least one configurable circuit of the plurality of configurable circuits of the array and another crossbar of the second plurality of crossbars; a memory interface coupled to the asynchronous packet network and the interconnection network; and a scheduling interface coupled to the asynchronous packet network and the interconnection network.
In another representative embodiment, a system may comprise: an interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising: a synchronization network; an asynchronous packet network; a memory interface coupled to the asynchronous packet network and the interconnection network; a scheduling interface coupled to the asynchronous packet network and the interconnection network; and a plurality of configurable circuits arranged in an array, each configurable circuit comprising: a configurable computing circuit; a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit; a thread control circuit; and a plurality of control registers; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronization network inputs and outputs coupled to the configurable computing circuitry and the synchronization network; an asynchronous network input queue and an asynchronous network output queue coupled to the asynchronous packet network; a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
In any of the various representative embodiments, the second instruction and instruction index memory may further store a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuitry.
In any of the various representative embodiments, the second instruction and instruction index memory may further store a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.
In any of the various representative embodiments, the second instruction and instruction index memory may further store a plurality of spoke instructions and a data path configuration instruction index to select a synchronized network output of the plurality of synchronized network outputs.
In any of the various representative embodiments, the configurable circuit or system may further comprise: a configuration memory multiplexer coupled to the first instruction memory and the second instruction and instruction index memory.
In any of the various representative embodiments, the current datapath configuration instruction may be selected using an instruction index from the second instruction and an instruction index memory when the select input of the configuration memory multiplexer has a first setting.
In any of the various representative embodiments, when the select input of the configuration memory multiplexer has a second setting different from the first setting, the current datapath configuration instruction may be selected using an instruction index from the master synchronization input.
In any of the various representative embodiments, the second instruction and instruction index memory may further store a plurality of spoke instructions and datapath configuration instruction indices to configure portions of the configurable circuit independent of the current datapath instruction.
In any of the various representative embodiments, selected ones of the plurality of spoke instruction and data path configuration instruction indices may be selected according to a modular spoke count.
In any of the various representative embodiments, the configurable circuit or system may further comprise: a conditional logic circuit coupled to the configurable computing circuit.
In any of the various representative embodiments, the conditional logic circuit is operable to modify the next datapath instruction index provided on a selected one of the plurality of synchronous network outputs, as a function of the output from the configurable computation circuit.
In any of the various representative embodiments, the conditional logic circuit is operable, in dependence upon an output from the configurable computational circuit, to provide a conditional branch by modifying the next datapath instruction or a next datapath instruction index provided on a selected one of the plurality of synchronous network outputs.
In any of the various representative embodiments, when enabled, the conditional logic circuit may be operative to specify the next datapath instruction or datapath instruction index by oring the least significant bits of the next datapath instruction with the output from the configurable computing circuit, providing a conditional branch.
In any of the various representative embodiments, when enabled, the conditional logic circuit may be operative to specify the next datapath instruction index by oring the least significant bits of the next datapath instruction index with the output from the configurable computing circuit, providing a conditional branch.
In any of the various representative embodiments, the plurality of synchronized network inputs may include: a plurality of input registers coupled to a plurality of communication lines of the synchronous network; and an input multiplexer coupled to the plurality of input registers and the second instruction and instruction index memory to select the master synchronization input.
In any of the various representative embodiments, the plurality of synchronized network outputs may include: a plurality of output registers coupled to a plurality of communication lines of the synchronous network; and an output multiplexer coupled to the configurable computing circuitry to select an output from the configurable computing circuitry.
In any of the various representative embodiments, the configurable circuit or system may further comprise: an asynchronous fabric state machine coupled to the asynchronous network input queue and the asynchronous network output queue, the asynchronous fabric state machine to decode input packets received from the asynchronous packet network and assemble output packets for transmission over the asynchronous packet network.
In any of the various representative embodiments, the asynchronous packet network may include a plurality of crossbars, each crossbar coupled to a plurality of configurable circuits and at least one other crossbar.
In any of the various representative embodiments, the configurable circuit or system may further comprise: an array of a plurality of configurable circuits, wherein: each configurable circuit is coupled to the synchronization network through the plurality of synchronization network inputs and the plurality of synchronization network outputs; and each configurable circuit is coupled to the asynchronous packet network through the asynchronous network input and the asynchronous network output.
In any of the various representative embodiments, the synchronization network may include a plurality of direct point-to-point connections coupling adjacent configurable circuits of the array of the plurality of configurable circuits.
In any of the various representative embodiments, each configurable circuit may comprise: a direct path connection between the plurality of input registers and the plurality of output registers. In any of the various representative embodiments, the direct path connection may provide a direct point-to-point connection for data transfer from a second configurable circuit received over the synchronous network to a third configurable circuit transmitted over the synchronous network.
In any of the various representative embodiments, the configurable computing circuitry may comprise arithmetic, logic, and bit-operation circuitry for performing at least one integer operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.
In any of the various representative embodiments, the configurable computing circuitry may comprise arithmetic, logical, and bit-arithmetic circuitry for performing at least one floating-point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater than or equal to, signed and unsigned less than or equal to, equal to or unequal comparison, logical AND operation, logical OR operation, logical XOR operation, logical NAND operation, logical NOR operation, logical XOR operation, logical NAND operation, integer and floating point transitions, and combinations thereof.
In any of the various representative embodiments, the configurable computing circuitry may comprise multiplication and shift operation circuitry for performing at least one integer operation selected from the group consisting of: multiplication, shifting, passing input (passin input), signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.
In any of the various representative embodiments, the configurable computing circuitry may comprise multiply and shift operation circuitry for performing at least a floating point operation selected from the group consisting of: multiplication, shifting, passing input (pass input), signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.
In any of the various representative embodiments, the array of the plurality of configurable circuits may be further coupled to a first interconnection network. In any of the various representative embodiments, the array of the plurality of configurable circuits may further comprise: a third system memory interface circuit; and a scheduling interface circuit. In any of the various representative embodiments, the scheduling interface circuit may be operative to receive a job descriptor packet over the first interconnection network, and in response to the job descriptor packet, generate one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform the selected computation.
In any of the various representative embodiments, the configurable circuit or system may further comprise: a flow control circuit coupled to the asynchronous network output queue, the flow control circuit to generate a stop signal when a predetermined threshold is reached in the asynchronous network output queue. In any of the various representative embodiments, each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal. In any of the various representative embodiments, each configurable compute circuit, in response to the stall signal, stalls execution after its current instruction is completed.
In any of the various representative embodiments, a first plurality of configurable circuits in an array of a plurality of configurable circuits may be coupled in a first predetermined order through a synchronization network to form a first synchronization domain; and wherein a second plurality of configurable circuits in the array of the plurality of configurable circuits are coupled in a second predetermined order through the synchronization network to form a second synchronization domain. In any of the various representative embodiments, the first synchronization domain may be used to generate a continue message that is transmitted to the second synchronization domain over the asynchronous packet network. In any of the various representative embodiments, the second synchronization domain may be used to generate a completion message that is transmitted to the first synchronization domain over the asynchronous packet network.
In any of the various representative embodiments, the plurality of control registers may store a completion table having a first data completion count. In any of the various representative embodiments, the plurality of control registers further stores a completion table having a second iteration count. In any of the various representative embodiments, the plurality of control registers may further store a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of the current thread. In any of the various representative embodiments, the plurality of control registers may further store an identification of a first iteration and an identification of a last iteration in the loop table.
In any of the various representative embodiments, the control circuitry may be operative to queue a thread for execution when, for the thread identifier of the thread, the completion count of the thread is decremented to zero and its thread identifier is the next thread.
In any of the various representative embodiments, the control circuitry may be operative to queue a thread for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier. In any of the various representative embodiments, the completion count may indicate, for each selected thread of the plurality of threads, a predetermined number of completion messages received before execution of the selected thread.
In any of the various representative embodiments, the plurality of control registers may further store a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.
In any of the various representative embodiments, the plurality of control registers may further store a completion table having a loop count of an active loop thread number, and wherein in response to receiving an asynchronous fabric message returning a thread identifier to a thread identifier pool, the control circuitry decrements the loop count and transmits the asynchronous fabric completion message when the loop count reaches zero. In any of the various representative embodiments, the plurality of control registers may further store a top of a thread identifier stack to allow each type of thread identifier to access the private variable for a selected loop.
In any of the various representative embodiments, the control circuit may further comprise: continuing the queue; and re-enqueue. In any of the various representative embodiments, the resume queue stores one or more thread identifiers for computing threads that have completion counts allowed to execute but do not yet have an assigned thread identifier. In any of the various representative embodiments, the re-entry queue may store one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers. In any of the various representative embodiments, any thread having a thread identifier in the re-entry queue may execute before any thread having a thread identifier in the resume queue is executed.
In any of the various representative embodiments, the control circuit may further comprise: a priority queue, wherein any thread in the priority queue having a thread identifier may execute before any thread in the resume queue or the re-entry queue having a thread identifier is executed.
In any of the various representative embodiments, the control circuit may further comprise: a run queue, wherein any thread in the run queue having a thread identifier can execute after a spoke count at which the thread identifier occurs.
In any of the various representative embodiments, the second configuration memory circuit may include: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.
In any of the various representative embodiments, the control circuitry may be used to self-schedule a computing thread for execution.
In any of the various representative embodiments, the conditional logic circuit is operable to branch to a different second next instruction for execution by the next configurable circuit.
In any of the various representative embodiments, the control circuitry may be used to order compute threads for execution. In any of the various representative embodiments, the control circuitry may be used to order loop computation threads for execution. In any of the various representative embodiments, the control circuitry may be operative to begin executing a computing thread in response to one or more completion signals from a data dependency.
Various method embodiments of configuring a configurable circuit are also disclosed. One representative method embodiment may comprise: providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and providing a plurality of spoke instructions and a data path configuration instruction index to select a master synchronization input of the plurality of synchronization network inputs using a second instruction and instruction index memory.
In any of the various representative embodiments, a method of configuring a configurable circuit may comprise: providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and providing a plurality of spoke instructions and datapath configuration instruction indices using a second instruction and instruction index memory to select a current datapath configuration instruction of the configurable computing circuitry.
In any of the various representative embodiments, a method of configuring a configurable circuit may comprise: providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and providing a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction of a next configurable computational circuit using a second instruction and instruction index memory.
Also disclosed is a method of controlling thread execution of a multi-threaded configurable circuit, wherein the configurable circuit has configurable computational circuitry. One representative method embodiment may comprise: conditional branches are provided using conditional logic circuitry by modifying a next datapath instruction or a next datapath instruction index provided to a next configurable circuit, depending on the output from the configurable computation circuitry.
Another representative method embodiment of controlling thread execution by a multithreaded configurable circuit may comprise: a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.
Another representative method embodiment of controlling thread execution by a multithreaded configurable circuit may comprise: a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of a current thread is stored using a plurality of control registers to provide in-order thread execution.
Another representative method embodiment of controlling thread execution by a multithreaded configurable circuit may comprise: storing, using a plurality of control registers, a completion table having a first data completion count; and queuing, using thread control circuitry, a thread for execution when a completion count of the thread is decremented to zero for a thread identifier of the thread.
A method of configuring and controlling thread execution of multithreaded configurable circuitry having configurable computing circuitry is disclosed, wherein a representative method embodiment comprises: providing, using a first instruction memory, a plurality of configuration instructions to configure a data path of the configurable computing circuit; providing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the configurable computing circuit, and select a next datapath instruction or a next datapath instruction index of a next configurable computing circuit using a second instruction and instruction index memory; providing, using a plurality of control registers, a completion table having a first data completion count; and queuing, using thread control circuitry, a thread for execution when a completion count of the thread is decremented to zero for a thread identifier of the thread.
Another method of configuring and controlling thread execution of a multi-threaded configurable circuit may comprise: providing, using a first instruction memory, a plurality of configuration instructions to configure a data path of the configurable computing circuit; providing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the configurable computing circuit, and select a next datapath instruction or a next datapath instruction index of a next configurable computing circuit using a second instruction and instruction index memory; providing, using a plurality of control registers, a completion table having a first data completion count; and using thread control circuitry to queue a thread for execution when, for its thread identifier, the completion count for the thread is decremented to zero and its thread identifier is the next thread.
Another method of controlling thread execution of a multi-threaded configurable circuit may comprise: storing, using a plurality of control registers, a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution, and wherein the plurality of control registers further store a top of a stack of thread identifiers; and allowing each type of thread identifier to access the private variable for the selected loop.
Another method of controlling thread execution of a multi-threaded configurable circuit may comprise: storing, using a plurality of control registers, a completion table having a data completion count; providing, using thread control circuitry, a continuation queue storing one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier; and providing, using thread control circuitry, a re-entry queue storing one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers, such that the threads in the re-entry queue execute after a specified spoke count.
Another method of controlling thread execution of a multi-threaded configurable circuit may comprise: storing, using a plurality of control registers, a pool of thread identifiers and a completion table having a loop count of the number of active loop threads; and decrementing, using thread control circuitry, the loop count in response to receiving an asynchronous fabric message that returns a thread identifier to the thread identifier pool, and transmitting an asynchronous fabric completion message when the loop count reaches zero.
In any of the various representative embodiments, the method may further comprise: providing, using the second instruction and instruction index memory, a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit.
In any of the various representative embodiments, the method may further comprise: providing, using the second instruction and instruction index memory, a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.
In any of the various representative embodiments, the method may further comprise: providing a plurality of spoke instructions and a data path configuration instruction index to select a synchronized network output of the plurality of synchronized network outputs using the second instruction and instruction index memory.
In any of the various representative embodiments, the method may further comprise: providing a first selection setting using a configuration memory multiplexer to select the current datapath configuration instruction using an instruction index from the second instruction and an instruction index memory.
In any of the various representative embodiments, the method may further comprise: providing, using a configuration memory multiplexer, a second selection setting, different from the first setting, to select the current datapath configuration instruction using an instruction index from a master synchronization input.
In any of the various representative embodiments, the method may further comprise: providing a plurality of spoke instructions and datapath configuration instruction indices using the second instruction and instruction index memory to configure a portion of the configurable circuit independent of the current datapath instruction.
In any of the various representative embodiments, the method may further comprise: selecting, using a configuration memory multiplexer, a spoke instruction and a data path configuration instruction index of the plurality of spoke instruction and data path configuration instruction indices according to a modular spoke count.
In any of the various representative embodiments, the method may further comprise: modifying the next datapath instruction or next datapath instruction index using conditional logic circuitry and in dependence upon output from the configurable computation circuitry.
In any of the various representative embodiments, the method may further comprise: providing a conditional branch by modifying the next datapath instruction or next datapath instruction index using conditional logic circuitry and in dependence upon output from the configurable computation circuitry.
In any of the various representative embodiments, the method may further comprise: enabling a conditional logic circuit; and using the conditional logic circuitry and in dependence upon an output from the configurable computation circuitry, specifying the next datapath instruction or datapath instruction index by oring the least significant bit of the next datapath instruction with the output from the configurable computation circuitry, thereby providing a conditional branch.
In any of the various representative embodiments, the method may further comprise: selecting the primary synchronization input using an input multiplexer. In any of the various representative embodiments, the method may further comprise: selecting an output from the configurable computing circuit using an output multiplexer.
In any of the various representative embodiments, the method may further comprise: using an asynchronous fabric state machine coupled to an asynchronous network input queue and an asynchronous network output queue, input data packets received from the asynchronous packet network are decoded and output data packets for transmission over the asynchronous packet network are assembled.
In any of the various representative embodiments, the method may further comprise: providing a plurality of direct point-to-point connections coupling adjacent configurable circuits of the array of the plurality of configurable circuits using the synchronization network.
In any of the various representative embodiments, the method may further comprise: using the configurable circuit, a direct path connection between a plurality of input registers and a plurality of output registers is provided. In any of the various representative embodiments, the direct path connection provides a direct point-to-point connection for data transfer from a second configurable circuit received over the synchronous network to a third configurable circuit transmitted over the synchronous network.
In any of the various representative embodiments, the method may further comprise: using the configurable computing circuitry, performing at least one integer or floating point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.
In any of the various representative embodiments, the method may further comprise: using the configurable computing circuitry, performing at least one integer or floating point operation selected from the group consisting of: multiplication, shifting, passing input (pass input), signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.
In any of the various representative embodiments, the method may further comprise: using a scheduling interface circuit, receiving a work descriptor packet over the first interconnection network, and in response to the work descriptor packet, generating one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform selected calculations.
In any of the various representative embodiments, the method may further comprise: a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit. In any of the various representative embodiments, each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal. In any of the various representative embodiments, each configurable compute circuit, in response to the stall signal, stalls execution after its current instruction is completed.
In any of the various representative embodiments, the method may further comprise: coupling a first plurality of configurable circuits in an array of a plurality of configurable circuits in a first predetermined order through the synchronization network to form a first synchronization domain; and coupling a second plurality of configurable circuits in the array of the plurality of configurable circuits in a second predetermined order through the synchronization network to form a second synchronization domain.
In any of the various representative embodiments, the method may further comprise: generating a continuation message from the first synchronous domain to the second synchronous domain for transmission over the asynchronous packet network.
In any of the various representative embodiments, the method may further comprise: generating a completion message from the second synchronous domain to the first synchronous domain for transmission over the asynchronous packet network. In any of the various representative embodiments, the method may further include storing a completion table having a first data completion count in the plurality of control registers.
In any of the various representative embodiments, the method may further comprise: storing the completion table with a second iteration count in the plurality of control registers.
In any of the various representative embodiments, the method may further comprise: a loop table having a plurality of thread identifiers is stored in the plurality of control registers, and for each thread identifier, a next thread identifier is stored for execution after execution of a current thread.
In any of the various representative embodiments, the method may further comprise: storing an identification of a first iteration and an identification of a last iteration in the loop table in the plurality of control registers.
In any of the various representative embodiments, the method may further comprise: using the control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.
In any of the various representative embodiments, the method may further comprise: using the control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.
In any of the various representative embodiments, the method may further comprise: using the control circuitry, a thread is queued for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier. In any of the various representative embodiments, the completion count may indicate, for each selected thread of the plurality of threads, a predetermined number of completion messages received before execution of the selected thread.
In any of the various representative embodiments, the method may further comprise: a completion table having a plurality of types of thread identifiers is stored in the plurality of control registers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.
In any of the various representative embodiments, the method may further comprise: storing, in the plurality of control registers, a completion table having a loop count of the number of active loop threads, and wherein in response to receiving an asynchronous fabric message that returns a thread identifier to a thread identifier pool, the loop count is decremented using the control circuitry and the asynchronous fabric completion message is transmitted when the loop count reaches zero.
In any of the various representative embodiments, the method may further comprise: storing the top of the thread identifier stack in the plurality of control registers to allow each type of thread identifier to access the private variable for the selected loop.
In any of the various representative embodiments, the method may further comprise: using the continue queue, one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier are stored.
In any of the various representative embodiments, the method may further comprise: using the re-entry queue, one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers are stored.
In any of the various representative embodiments, the method may further comprise: executing any threads having thread identifiers in the re-enqueue before executing any threads having thread identifiers in the resume queue.
In any of the various representative embodiments, the method may further comprise: executing any threads having a thread identifier in a priority queue prior to executing any threads having a thread identifier in the continue queue or the re-enter queue.
In any of the various representative embodiments, the method may further comprise: any thread in the run queue is executed after the spoke count for the thread identifier occurs.
In any of the various representative embodiments, the method may further comprise: using the control circuitry, computing threads are self-scheduled for execution.
In any of the various representative embodiments, the method may further comprise: using the conditional logic circuit, branching to a different second next instruction for execution by a next configurable circuit.
In any of the various representative embodiments, the method may further comprise: using the control circuitry, the computing threads are ordered for execution.
In any of the various representative embodiments, the method may further comprise: using the control circuitry, the loop computation thread is ordered for execution.
In any of the various representative embodiments, the method may further comprise: using the control circuitry, execution of a computing thread is commenced in response to one or more completion signals from the data dependency.
A self-scheduling processor is disclosed. In a representative embodiment, the processor comprises: a processor core to execute the received instructions; and core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to received work descriptor data packets. In another representative embodiment, the processor comprises: a processor core to execute the received instructions; and core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to received event data packets.
A multithreaded self-scheduling processor is also disclosed which may create threads on local or remote computing elements. In a representative embodiment, the processor comprises: a processor core to execute a shred creation instruction; and core control circuitry coupled to the processor core, the core control circuitry to automatically schedule the shred creation instructions for execution by the processor core and generate one or more job descriptor data packets destined for another processor or hybrid thread fabric circuitry to execute a corresponding plurality of execution threads. In another representative embodiment, the processor comprises: a processor core to execute a shred creation instruction; and core control circuitry coupled to the processor core, the core control circuitry to schedule the shred creation instructions for execution by the processor core, reserve a predetermined amount of memory space in a thread control memory to store return arguments, and generate one or more job descriptor data packets destined for another processor or a hybrid thread fabric circuit to execute a corresponding plurality of execution threads.
In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface; a thread control memory coupled to the interconnect network interface; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory; and an instruction cache coupled to the control logic and thread selection circuitry; the processor additionally includes a processor core coupled to the instruction cache of the core control circuitry.
In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface; a thread control memory coupled to the interconnect network interface; a network response memory; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory; an instruction cache coupled to the control logic and thread selection circuitry; and a command queue; the processor additionally includes a processor core coupled to the instruction cache and the command queue of the core control circuitry.
In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier to execute the execution thread.
In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution of instructions of the execution thread by a processor core.
In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution by the processor core of an instruction executing a thread.
In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution of instructions of the execution thread by the processor core, the processor core using data stored in the data cache or general purpose register.
In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, and periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.
In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration that the valid state remains unchanged, and suspend thread execution by not returning the thread identifier to the execution queue when the thread identifier has a suspended state.
In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution by the processor core of an instruction executing a thread.
In another representative embodiment, a processor comprises: a processor core to execute a plurality of instructions; and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and an instruction cache coupled to the processor core and the control logic and thread selection circuitry to receive the initial program count and provide a corresponding one of the plurality of instructions to the processor core for execution.
In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, periodically select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; the processor additionally includes a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.
In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and a command queue; the processor additionally includes a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.
In another representative embodiment, a processor comprises: a core control circuit coupled to the interconnect network interface and comprising: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; the processor additionally includes a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.
In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface coupleable to the interconnection network to receive the invoke work descriptor packet, decode the received work descriptor packet into an execution thread having an initial program count and any received arguments, and encode the work descriptor packet for transmission to other processing elements; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; a network response memory coupled to the interconnect network interface; control logic and thread selection circuitry coupled to the execution queue, the thread control memory, and an instruction cache, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and a command queue storing one or more commands to generate one or more work descriptor packets; the processor additionally includes a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.
For any of the various representative embodiments, the core control circuitry may further comprise: an interconnection network interface coupleable to an interconnection network, the interconnection network interface to receive a work descriptor packet, decode the received work descriptor packet into an execution thread having an initial program count and any received arguments. For any of the various representative embodiments, the interconnection network interface may be further operable to receive an event data packet, decode the received event data packet into an event identifier and any received arguments.
For any of the various representative embodiments, the core control circuitry may further comprise: control logic and thread selection circuitry coupled to the interconnect network interface, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread.
For any of the various representative embodiments, the core control circuitry may further comprise: a thread control memory having a plurality of registers, wherein the plurality of registers comprises one or more of the following in any selected combination: a thread identifier pool register storing a plurality of thread identifiers; a thread state register; a program count register to store the received initial program count; a general register storing the received argument; a pending fiber return count register; returning to the argument buffer or register; returning to the argument chain table register; self-defining an atomic transaction identifier register; an event status register; an event mask register; and data caching.
For any of the various representative embodiments, the interconnect network interface may be further operable to store the execution thread with the initial program count and any received arguments in the thread control memory using a thread identifier as an index to the thread control memory.
For any of the various representative embodiments, the core control circuitry may further comprise: control logic and thread selection circuitry coupled to the thread control memory and the interconnect network interface, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread.
For any of the various representative embodiments, the core control circuitry may further comprise: an execution queue coupled to the thread control memory, the execution queue storing one or more thread identifiers.
For any of the various representative embodiments, the core control circuitry may further comprise: control logic and thread selection circuitry coupled to the execution queue, the interconnect network interface, and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, and access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread.
For any of the various representative embodiments, the core control circuitry may further comprise: an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution.
In another representative embodiment, the processor may further comprise: a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.
For any of the various representative embodiments, the core control circuitry may be further operative to assign an initial valid state to the execution thread. For any of the various representative embodiments, the core control circuitry may be further operative to assign a halt state to the execution thread in response to the processor core executing a memory load instruction. For any of the various representative embodiments, the core control circuitry may be further operative to allocate a halt state to the execution thread in response to the processor core executing a memory store instruction.
For any of the various representative embodiments, the core control circuitry may be further operative to end execution of a selected thread in response to the processor core executing a return instruction. For any of the various representative embodiments, the core control circuitry may be further operative to return the corresponding thread identifier for the selected thread to the thread identifier pool register in response to the processor core executing a return instruction. For any of the various representative embodiments, the core control circuitry may be further operative to clear a register of the thread control memory indexed by the corresponding thread identifier of the selected thread in response to the processor core executing a return instruction.
For any of the various representative embodiments, the interconnect network interface may be further operative to generate a return job descriptor packet in response to the processor core executing a return instruction.
For any of the various representative embodiments, the core control circuitry may further comprise: the network response memory. For any of the various representative embodiments, the network response memory may comprise one or more of the following in any selected combination: a memory request register; a thread identifier and transaction identifier register; requesting a cache line index register; a byte register; and general register index and type registers.
For any of the various representative embodiments, the interconnection network interface may be used to generate a point-to-point event data message. For any of the various representative embodiments, the interconnection network interface may be operative to generate a broadcast event data message.
For any of the various representative embodiments, the core control circuitry may be further operative to respond to a received event data packet using an event mask stored in the event mask register. For any of the various representative embodiments, the core control circuitry may be further operative to determine an event number corresponding to the received event data packet. For any of the various representative embodiments, the core control circuitry may be further operative to change the state of the thread identifier from suspended to active in response to the received event data packet to resume execution of the corresponding thread of execution. For any of the various representative embodiments, the core control circuitry may be further operative to change the state of the thread identifier from suspended to valid in response to the event number of the received event data packet to resume execution of the corresponding thread of execution.
For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to successively select a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to perform a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to perform a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread until the execution thread is completed. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to perform a barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.
For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to assign an active state or a suspended state to the thread identifier. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to assign a priority status to the thread identifier. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to return a corresponding thread identifier having an assigned valid state and an assigned priority to the execution queue after execution of a corresponding instruction.
For any of the various representative embodiments, the core control circuitry may further comprise: a network command queue coupled to the processor core.
For any of the various representative embodiments, the interconnection network interface may include: inputting a queue; a packet decoder circuit coupled to the input queue, the control logic and thread selection circuit, and the thread control memory; an output queue; and a packet encoder circuit coupled to the output queue, the network response memory, and the network command queue.
For any of the various representative embodiments, the execution queue may further comprise: a first priority queue; and a second priority queue. For any of the various representative embodiments, the control logic and thread selection circuitry may further comprise: thread selection control circuitry coupled to the execution queue, the thread selection control circuitry to select a thread identifier from the first priority queue at a first frequency and to select a thread identifier from the second priority queue at a second frequency, the second frequency being lower than the first frequency. For any of the various representative embodiments, the thread selection control circuitry may be operative to determine the second frequency as a skip count starting with selection of a thread identifier from the first priority queue.
For any of the various representative embodiments, the core control circuitry may further comprise: data path control circuitry for controlling access size through the first interconnection network. For any of the various representative embodiments, the core control circuitry may further comprise: data path control circuitry to increase or decrease a memory load access size in response to a time-averaged usage level. For any of the various representative embodiments, the core control circuitry may further comprise: data path control circuitry to increase or decrease a memory storage access size in response to a time-averaged usage level. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to increase a size of a memory load access request to correspond to a cache line boundary of the data cache.
For any of the various representative embodiments, the core control circuitry may further comprise: system call circuitry to generate one or more system calls to a host processor. For any of the various representative embodiments, the system call circuitry may further comprise: a plurality of system call credit registers storing a predetermined credit count to modulate the number of system calls in any predetermined time period.
For any of the various representative embodiments, the core control circuitry may be further to generate a command to cause the command queue of the interconnect network interface to copy and transmit all data corresponding to a selected thread identifier from the thread control memory to monitor thread status in response to a request from a host processor.
For any of the various representative embodiments, the processor core may be used to execute a shred creation instruction to generate one or more commands that cause the command queue of the interconnect network interface to generate one or more call work descriptor packets destined for another processor core or a hybrid thread fabric circuit. For any of the various representative embodiments, the core control circuitry may be further operative to reserve a predetermined amount of memory space in the general purpose register or a return argument register in response to the processor core executing a shred creation instruction.
For any of the various representative embodiments, the core control circuitry may be operative to store a thread return count in the thread return register in response to generating one or more call work descriptor packets destined for another processor core or a hybrid thread fabric. For any of the various representative embodiments, in response to receiving a return data packet, the core control circuitry may be operative to decrement the thread return count stored in the thread return register. For any of the various representative embodiments, in response to the thread return count in the thread return register decrementing to zero, the core control circuitry may be operative to change the suspended state of the corresponding thread identifier to a valid state for subsequent execution of a thread return instruction to complete the created fiber or thread.
For any of the various representative embodiments, the processor core may be used to execute a wait or a non-wait fiber join instruction. For any of the various representative embodiments, the processor core may be operative to execute a fibre join instruction.
For any of the various representative embodiments, the processor core may be operative to execute a non-cache read or load instruction to specify a general purpose register for storing data received from memory. For any of the various representative embodiments, the processor core may be operative to execute a non-cache write or store instruction to designate data in a general purpose register for storage in memory.
For any of the various representative embodiments, the core control circuitry may be operative to assign a transaction identifier to any load or store request to memory and correlate the transaction identifier with a thread identifier.
For any of the various representative embodiments, the processor core may be operative to execute a first thread priority instruction to assign a first priority to an execution thread having a corresponding thread identifier. For any of the various representative embodiments, the processor core may be operative to execute a second thread priority instruction to assign a second priority to the execution thread having the corresponding thread identifier.
For any of the various representative embodiments, the processor core may be operative to execute a custom atomic return instruction to complete a thread of execution of a custom atomic operation. For any of the various representative embodiments, in conjunction with a memory controller, the processor core may be used to perform floating point atomic memory operations. For any of the various representative embodiments, in conjunction with a memory controller, the processor core may be used to perform custom atomic memory operations.
Also disclosed is a method of self-scheduled execution of instructions, wherein an exemplary method embodiment comprises: receiving a work descriptor data packet; and automatically scheduling instructions for execution in response to the received job descriptor data packet.
Another method of self-scheduled execution of instructions is also disclosed, wherein one representative method embodiment comprises: receiving an event data packet; and automatically scheduling instructions for execution in response to the received event data packet.
Also disclosed is a method of a first processing element generating a plurality of threads of execution for execution by a second processing element, wherein a representative method embodiment comprises: executing a fiber program creating instruction; and in response to execution of the shred creation instruction, generating one or more job descriptor data packets destined for the second processing element to execute the plurality of execution threads.
Also disclosed is a method of a first processing element generating a plurality of threads of execution for execution by a second processing element, wherein a representative method embodiment comprises: executing a fiber program creating instruction; and in response to executing the shred creation instruction, reserving a predetermined amount of memory space in a thread control memory to store return arguments and generating one or more job descriptor data packets destined for the second processing element to execute the plurality of execution threads.
Also disclosed is a method of self-scheduled execution of instructions, wherein an exemplary method embodiment comprises: receiving a work descriptor data packet; decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments; assigning an available thread identifier to the execution thread; automatically queuing the thread identifier for execution of the execution thread; and periodically selecting the thread identifier to execute the execution thread.
Another method of self-scheduled execution of instructions is also disclosed, wherein one representative method embodiment comprises: receiving a work descriptor data packet; decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments; assigning an available thread identifier to the execution thread; automatically queuing the thread identifier for execution of the execution thread when the thread identifier has a valid state; and periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.
Another method of self-scheduled execution of instructions is also disclosed, wherein one representative method embodiment comprises: receiving a work descriptor data packet; decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments; assigning an available thread identifier to the execution thread; automatically queuing the thread identifier in an execution queue for execution of the execution thread when the thread identifier has a valid state; and periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged; and when the thread identifier has a suspended state, suspending thread execution by not returning the thread identifier to the execution queue.
Another method of self-scheduled execution of instructions is also disclosed, wherein one representative method embodiment comprises: receiving a work descriptor data packet; decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments; storing the initial program count and any received arguments in a thread control memory; assigning an available thread identifier to the execution thread; automatically queuing the thread identifier for execution of the execution thread when the thread identifier has a valid state; accessing the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.
For any of the various representative embodiments, the method may further comprise: receiving an event data packet; and decoding the received event data packet into an event identifier and any received arguments.
For any of the various representative embodiments, the method may further comprise: an initial valid state is assigned to the execution thread.
For any of the various representative embodiments, the method may further comprise: a suspend state is assigned to the execution thread in response to executing a memory load instruction. For any of the various representative embodiments, the method may further comprise: a suspend state is assigned to the execution thread in response to executing a memory store instruction.
For any of the various representative embodiments, the method may further comprise: in response to executing the return instruction, execution of the selected thread is terminated. For any of the various representative embodiments, the method may further comprise: in response to executing a return instruction, returning a corresponding thread identifier for the selected thread to the thread identifier pool. For any of the various representative embodiments, the method may further comprise: in response to executing a return instruction, clearing registers of a thread control memory indexed by the corresponding thread identifier of the selected thread. For any of the various representative embodiments, the method may further comprise: in response to executing the return instruction, a return job descriptor packet is generated.
For any of the various representative embodiments, the method may further comprise: a point-to-point event data message is generated. For any of the various representative embodiments, the method may further comprise: a broadcast event data message is generated.
For any of the various representative embodiments, the method may further comprise: the received event data packet is responded to with an event mask. For any of the various representative embodiments, the method may further comprise: an event number corresponding to the received event data packet is determined. For any of the various representative embodiments, the method may further comprise: the state of the thread identifier is changed from paused to active in response to the received event data packet to resume execution of the corresponding thread of execution. For any of the various representative embodiments, the method may further comprise: the state of the thread identifier is changed from suspended to valid in response to the event number of the received event data packet to resume execution of the corresponding execution thread.
For any of the various representative embodiments, the method may further comprise: successively selecting a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread. For any of the various representative embodiments, the method may further comprise: performing a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers for executing a single instruction of a corresponding execution thread, respectively. For any of the various representative embodiments, the method may further comprise: performing a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers for executing a single instruction of a corresponding execution thread, respectively, until the execution thread is completed. For any of the various representative embodiments, the method may further comprise: performing a barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.
For any of the various representative embodiments, the method may further comprise: a valid state or a suspended state is assigned to the thread identifier. For any of the various representative embodiments, the method may further comprise: a priority status is assigned to the thread identifier.
For any of the various representative embodiments, the method may further comprise: after executing a corresponding instruction, returning the corresponding thread identifier with an assigned valid state and an assigned priority to the execution queue.
For any of the various representative embodiments, the method may further comprise: selecting a thread identifier from a first priority queue at a first frequency and selecting a thread identifier from a second priority queue at a second frequency, the second frequency being lower than the first frequency. For any of the various representative embodiments, the method may further comprise: determining the second frequency as a skip count from selecting a thread identifier from the first priority queue.
For any of the various representative embodiments, the method may further comprise: controlling the data path access size. For any of the various representative embodiments, the method may further comprise: the memory load access size is increased or decreased in response to the time-averaged usage level. For any of the various representative embodiments, the method may further comprise: the memory storage access size is increased or decreased in response to the time-averaged usage level. For any of the various representative embodiments, the method may further comprise: the size of the memory load access request is increased to correspond to a cache line boundary of the data cache.
For any of the various representative embodiments, the method may further comprise: one or more system calls are generated to a host processor. For any of the various representative embodiments, the method may further comprise: the number of system calls within any predetermined time period is modulated using a predetermined credit count.
For any of the various representative embodiments, the method may further comprise: in response to a request from the host processor, all data from the thread control memory corresponding to the selected thread identifier is copied and transferred to monitor thread status.
For any of the various representative embodiments, the method may further comprise: the fabric creation instruction is executed to generate one or more commands that generate one or more call job descriptor packets destined for another processor core or a hybrid thread fabric circuit. For any of the various representative embodiments, the method may further comprise: in response to executing the shred creation instruction, a predetermined amount of memory space is reserved to store any return arguments. For any of the various representative embodiments, the method may further comprise: storing a thread return count in the thread return register in response to generating one or more call work descriptor packets. For any of the various representative embodiments, the method may further comprise: in response to receiving a return data packet, the thread return count stored in the thread return register is decremented. For any of the various representative embodiments, the method may further comprise: responsive to the thread return count in the thread return register decrementing to zero, the suspended state of the corresponding thread identifier is changed to an active state for subsequent execution of a thread return instruction to complete the created fiber or thread.
For any of the various representative embodiments, the method may further comprise: a wait or non-wait fiber join instruction is executed. For any of the various representative embodiments, the method may further comprise: and executing the all-fiber adding instruction.
For any of the various representative embodiments, the method may further comprise: a non-cache read or load instruction is executed to specify a general purpose register for storing data received from memory.
For any of the various representative embodiments, the method may further comprise: a non-cache write or store instruction is executed to specify data in the general purpose registers for storage in memory.
For any of the various representative embodiments, the method may further comprise: the transaction identifier is assigned to any load or store request to memory and is correlated with the thread identifier.
For any of the various representative embodiments, the method may further comprise: a first thread priority instruction is executed to assign a first priority to an execution thread having a corresponding thread identifier. For any of the various representative embodiments, the method may further comprise: a second thread priority instruction is executed to assign a second priority to the execution thread having the corresponding thread identifier.
For any of the various representative embodiments, the method may further comprise: and executing the self-defined atomic return instruction to complete the execution thread of the self-defined atomic operation.
For any of the various representative embodiments, the method may further comprise: a floating point atomic memory operation is performed.
For any of the various representative embodiments, the method may further comprise: a custom atomic memory operation is performed.
Many other advantages and features of the invention will become apparent from the following detailed description of the invention and the examples thereof, the claims, and the accompanying drawings.
Drawings
The objects, features and advantages of the present invention will be more readily understood by reference to the following disclosure when considered in connection with the accompanying drawings in which like reference numerals are used to designate like components in the various figures and in which reference numerals with alphabetic characters are used to designate additional types, examples or variations of selected component embodiments in the various figures, wherein:
FIG. 1 is a block diagram of a representative first embodiment of a hybrid computing system.
FIG. 2 is a block diagram of a representative second embodiment of a hybrid computing system.
FIG. 3 is a block diagram of a representative third embodiment of a hybrid computing system.
FIG. 4 is a block diagram of a representative embodiment of a hybrid thread fabric having configurable computing circuitry coupled to a first interconnection network.
FIG. 5 is a high-level block diagram of a portion of a representative embodiment of a hybrid thread fabric circuit group.
Fig. 6 is a high-level block diagram of a second interconnect network within a hybrid thread fabric circuit group.
FIG. 7 is a detailed block diagram of a representative embodiment of a hybrid thread fabric circuit group.
FIG. 8 is a detailed block diagram of a representative embodiment of a hybrid thread fabric configurable computing circuit (tile).
Fig. 9A and 9B (collectively fig. 9) are detailed block diagrams of representative embodiments of hybrid thread fabric configurable computing circuits (tiles).
FIG. 10 is a detailed block diagram of a representative embodiment of a memory control circuit of a hybrid thread fabric configurable computing circuit (tile).
FIG. 11 is a detailed block diagram of a representative embodiment of thread control circuitry of a hybrid thread fabric configurable computing circuit (tile).
FIG. 12 is a diagram of representative hybrid thread fabric configurable computing circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging.
FIG. 13 is a block diagram of a representative embodiment of a memory interface.
FIG. 14 is a block diagram of a representative embodiment of a scheduling interface.
Fig. 15 is a block diagram of a representative embodiment of an optional first network interface.
FIG. 16 is a diagram of representative hybrid thread fabric configurable compute circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging for a group of hybrid thread fabric circuits to perform computations.
17A, 17B, and 17C (collectively FIG. 17) are flow diagrams of representative asynchronous packet network messaging and execution by a hybrid thread fabric configurable compute circuit (tile) for a hybrid thread fabric group to perform the computation of FIG. 16.
FIG. 18 is a diagram of representative hybrid thread fabric configurable compute circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging for a group of hybrid thread fabric circuits to perform computations.
Fig. 19A and 19B (collectively fig. 19) are flow diagrams of representative asynchronous packet network messaging and execution by a hybrid thread fabric configurable compute circuit (tile) for a group of hybrid thread fabric circuits to perform the computation of fig. 18.
FIG. 20 is a diagram of representative hybrid thread fabric configurable compute circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging for a group of hybrid thread fabric circuits to perform a compute loop.
Fig. 21 is a flow diagram of representative asynchronous packet network messaging and execution by a hybrid thread fabric configurable compute circuit (tile) for a group of hybrid thread fabric circuits to perform a loop in the computation of fig. 20.
FIG. 22 is a block diagram of a representative flow control circuit.
FIG. 23 is a diagram of representative hybrid thread fabric configurable compute circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging and synchronous messaging for a group of hybrid thread fabric circuits to perform a compute loop.
FIG. 24 is a circuit block diagram of a representative embodiment of conditional branch circuitry.
Fig. 25 is a high-level block diagram of a representative embodiment of a hybrid-
FIG. 26 is a detailed block diagram of a representative embodiment of a thread memory of a hybrid thread processor.
FIG. 27 is a detailed block diagram of a representative embodiment of a network response memory of a hybrid thread processor.
FIG. 28 is a detailed block diagram of a representative embodiment of a mixed-threaded processor.
Fig. 29A and 29B (collectively fig. 29) are a flow diagram of a representative embodiment of a method for mixing self-scheduling and thread control of a threaded processor.
FIG. 30 is a detailed block diagram of a representative embodiment of thread selection control circuitry that mixes the control logic and thread selection circuitry of a threaded processor.
FIG. 31 is a block diagram of a representative embodiment and a representative data packet of a portion of a first interconnection network.
FIG. 32 is a detailed block diagram of a representative embodiment of data path control circuitry of a mixed-thread processor.
FIG. 33 is a detailed block diagram of representative embodiments of system call circuitry and host interface circuitry of a mixed-thread processor.
Fig. 34 is a block diagram of a representative first embodiment of a first interconnection network.
Fig. 35 is a block diagram of a representative second embodiment of the first interconnection network.
Fig. 36 is a block diagram of a representative third embodiment of the first interconnection network.
FIG. 37 illustrates a representative virtual address space format supported by the system architecture.
Fig. 38 shows a representative conversion process for each virtual address format.
FIG. 39 illustrates a representative send call instance for a hybrid thread.
FIG. 40 shows a representative transmit fork example for a hybrid thread.
FIG. 41 illustrates a representative send-pass example for a hybrid thread.
FIG. 42 illustrates a representative call chain use case for a hybrid thread.
Detailed Description
While this invention is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific exemplary embodiments of the invention with the understanding that the present disclosure is to be considered as an exemplification of the principles of the invention and is not intended to limit the invention to the specific embodiments illustrated. In this respect, before explaining at least one embodiment in detail, it is to be understood that the invention is not limited in its application to the details of construction and to the arrangements of the components set forth above and illustrated in the following description, illustrated in the drawings, or otherwise described by way of example. The methods and apparatus according to the present invention are capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein, as well as the abstract included below, are for the purpose of description and should not be regarded as limiting.
I. Hybrid computing system 100 and interconnection network:
fig. 1, 2, and 3 are block diagrams of representative first, second, and third embodiments of hybrid computing systems 100A, 100B, 100C (collectively referred to as system 100). FIG. 4 is a block diagram of a representative embodiment of a hybrid thread fabric ("HTF") 200 having configurable computing circuitry coupled to a first interconnection network 150 (also referred to simply as "NOC", on behalf of "network on chip"). Fig. 5 is a high-level block diagram of a portion of a representative embodiment of a hybrid thread
FIG. 8 is a high-level block diagram of a representative embodiment of a hybrid thread-fabric
Referring to fig. 1-9, the hybrid computing system 100 includes a hybrid thread processor ("HTP") 300, discussed in more detail below with reference to fig. 25-33, coupled to one or more hybrid thread fabric ("HTF")
Each node of system 100 runs a separate Operating System (OS) instance, thereby controlling the resources of the associated node. An application that spans multiple nodes is executed by coordinating multiple OS instances of the spanned nodes. The processes associated with the applications running on each node have an address space that provides access to node private memory and global shared memory distributed across the nodes. Each OS instance contains drivers that manage local node resources. The shared address space of the application is commonly managed by a set of drivers running on the node. The shared address space is assigned a global space id (gsid). The number of global spaces that are active at any given time is expected to be relatively small. GSID is set to 8 bits wide.
As used herein, hybrid threads refer to the ability to spawn multiple computing fibers and threads across different heterogeneous types of processing circuits (hardware), e.g., across HTF circuit 200 (as a reconfigurable computing fabric) and across processors, such as HTP300 or another type of RISC-V processor. Hybrid threads also refer to programming languages/styles in which a worker thread transitions from one compute element to the next to move the computation to the location where the data is located, such transitions also being implemented in representative embodiments. The
Also in the representative embodiment, the HTP300 is a RISC-V ISA based multithreaded processor having one or more processor cores 705 with an extended instruction set, as well as one or more core control circuits 710 and one or more second memories 715, referred to as core control (or thread control) memories 715, as discussed in more detail below. In general, the HTP300 provides a bucket-wise round-robin instantaneous thread switch to maintain a high instruction-per-clock rate.
For purposes herein,
The
Fig. 31 is a diagram of a representative embodiment and a representative data packet of a portion of the
Fig. 34-36 are block diagrams of representative first, second, and third embodiments of a
Routing through any of the various
The packets of the
Table 1:
generic packet header for
Table 2:
read request packet for
Table 3:
16B read response packet (with 8B flit) for
The
The HTP300 is a barrel multithreaded processor designed to perform well in applications with a high degree of parallelism for sparse data sets (i.e., applications with minimal data reuse). The HTP300 is based on an open source RISC-V processor and executes in user mode. The HTP300 includes more RISC-V user mode instructions, plus a set of custom instructions that allow threads to manage, send and receive events to/from
Sparse data sets typically produce poor cache hit rates. The HTP300, which has many threads per HTP processor core 705, allows some threads to wait for a response from the
II, mixed thread:
the hybrid threads of system 100 allow computing tasks to transition from
The job descriptor packet is used to start the job on the HTP300 and
For purposes of this disclosure, at a high or general level, a job descriptor package includes (e.g., without limitation): (1) information required to route the job descriptor packet to its destination; (2) initialize the thread context of the HTP300 and/or the
At a high or general level, and as discussed in more detail below, the
Once a thread is in the active thread queue on HTP300, it is selected to execute instructions. Eventually, the thread will complete its computational task. At this point, the HTP300 will send a return message back to the calling processor by executing a single custom RISC-V send return instruction. Sending a return is similar to sending a call. The instruction frees the stack and background structures and sends a maximum of four 64-bit parameters back to the calling processor. Invoking the HTP300 executes the receive return custom RISC-V instruction to receive the return. The
The HTP300 has three options for sending a work task to another HTP300 or HTF200 computing element. These options are for performing calls, forking or passing, as shown in fig. 39 to 41:
(a) the call 901 initiates a computing task on the remote HTP300 or HTF200 and suspends other instruction execution until a
(b) The fork (903) initiates a computational task on the remote HTP300 or HTF200 and continues executing instructions. A single thread may initiate many computing tasks on a remote HTP300 or HTF200 computing element using a send fork mechanism. The original thread must wait until a return has been received from each of the forked threads before sending its return (902). The return information passed to the remote computing element is used by the remote computing task when it has completed and is ready to return.
(c) The transfer (904) initiates a computing task on the remote HTP300 or HTF200 and terminates the original thread. The return (902) information passed to the remote compute element is the return information from the call, fork, or pass that initiated the current thread. The issue fork (903) contains information back to the thread executing the deliver fork instruction on the
Although calls, forking, and passing for communication between
Threads may access private memory on local nodes and shared memory on local and remote nodes by referencing virtual address spaces. The HTP300 thread will primarily use the provided inbound call arguments and private memory stack to manipulate data structures stored in shared memory. Similarly, the HTF200 threads will use inbound call arguments and fabric memory to manipulate data structures stored in the shared memory.
When a thread is created, HTP300 threads are typically provided with up to four call arguments and a stack. The arguments are located in registers (memory 715, discussed below) and the stack is located in the node private memory. By using a standard stack frame based calling method, a thread will typically call to use stacking locally to the thread private variable and to the
Each HTP300 thread has a background block provided at the time the thread is launched. The context block provides a location in
Typically, when a thread is created, a maximum of four call arguments are also provided to the HTF200 thread. The arguments are in an in-fabric memory structure for access by data stream computations. The in-fabric memory is also used for thread private variables. The HTF200 thread may access the entire partitioned global memory of the application.
The computing elements of system 100 have different capabilities that are each uniquely suited to a particular computing task. Host processor 110 (internal or external to the device) is designed for the lowest possible latency when executing a single thread. The HTP300 is optimized for parallel execution of a larger set of threads to provide the highest execution throughput. The HTF200 is optimized for the extremely high performance of the dataflow style core. Computing elements have been constructed to transfer computing tasks from one element to the next with extreme efficiency to execute the computing core as efficiently as possible. FIG. 42 illustrates a representative call chain usage instance for a hybrid thread with each compute element and illustrates a traditional hierarchical usage model similar to a simulation. High-throughput data-intensive applications may use different usage models that are oriented towards several independent streams.
The entire application begins executing on host processor 110 (internal or external).
The
The entire process of the
Return packets are transmitted to
Hybrid thread fabric 200:
by way of overview, the
The
In the representative embodiment, in most cases, thread (e.g., core) execution and control signaling are separated between the two different networks, where thread execution occurs using the synchronous
In the exemplary embodiment,
Referring to fig. 8, a
A representative example of each of these various components is shown and discussed below with reference to fig. 9. For example, in a representative embodiment, the one or more
Notably, as discussed in more detail below, the configuration memory (e.g., RAM)160 includes configuration circuitry (e.g., configuration memory multiplexer 372) and two different configuration memories that perform different configuration functions, namely a first instruction RAM315 (which is used to configure the internal data paths of tiles 210) and a second instruction and instruction index memory (RAM)320, referred to herein as "spoke" RAM320 (which is used for a number of purposes, including configuring portions of
As shown in fig. 8 and 9, communication line (or wire) 270 is shown as communication lines (or wires) 270A and 270B, such that communication line (or wire) 270A is the "input" (input communication line (or wire)) that feeds data into input register 350, and communication line (or wire) 270B is the "output" (output communication line (or wire)) that shifts data out of
It should be noted that there are various fields in the communication lines of the various groups or buses that form the synchronous
Additionally, as discussed in more detail below, for any input received over the synchronous
In
In accordance with selected embodiments, each of the
Those skilled in the electronic arts will recognize that the connections between
The synchronous
Each of the
Messages are routed from a source endpoint through the
Fig. 13 is a block diagram of a representative embodiment of the
The
Fig. 14 is a block diagram of a representative embodiment of the
As mentioned above, the
It should be noted that as mentioned above, multiple levels (or types) of TIDs may be used, and are often used. For example, the
It should also be noted that separate transaction IDs are utilized to track individual memory requests through the
Fig. 15 is a block diagram of a representative embodiment of an optional first network interface. Referring to fig. 15, when included, each
Referring again to fig. 9, a representative HTF reconfigurable computing circuit (tile) 210A includes at least one multiply and shift operation circuit ("MS Op") 305, at least one arithmetic, logical and bit operation circuit ("ALB Op") 310, a first instruction RAM315, a second instruction (and index) RAM320, referred to herein as a "spoke" RAM320, one or more tile memory circuits (or memories) 325 (shown as memory "0" 325A, memory "1" 325B through memory "N" 325C, and individually and collectively referred to as memory 325 or tile memory 325). In addition, as previously mentioned, the
The synchronous
1. data: typically 64 bits wide and includes calculated data that is passed from one
2. Instruction RAM315 address: abbreviated as "INSTR," typically has a field width of 8 bits and includes the instruction RAM315 address for the
3. Thread identifier: referred to herein as a "TID," typically has a field width of 8 bits, and includes a unique identifier of the thread of the core, with a predetermined number of TIDs ("TID pools") stored in control register 340 and possibly available to the thread if not used by another thread. The TID is allocated at the
4. The transmission identifier is: referred to as an "XID," typically has a field width of 8 bits and includes a unique identifier for passing data from one synchronization domain to another, where a predetermined number of XIDs ("XID pools") are stored in control register 340 and are potentially available to a thread (if not used by another thread). The transfer may be a direct write of data from one domain to another, as "XID WR," or it may be the result of a read of the memory 125 (as "XID RD"), where the source domain sends a virtual address to the
The synchronous
A highly innovative feature of the architecture of the
Each tile has an instruction RAM315 containing configuration information that sets the datapath of
The instruction set supported should meet the needs of the target application, e.g., an application having data types of 32 and 64 bit integer and floating point values. Additional standby applications such as machine learning, image analysis, and 5G wireless processing may be executed using the
Spoke RAM320 has multiple functions, and in a representative embodiment, one of those functions will be used to configure the portion of tile 210 (time sliced) that is independent of the current instructions of the data path, i.e., the
Spoke RAM320 also specifies when synchronization input 270A is written to tile memory 325. This situation may occur where a tile instruction requires multiple inputs and one of the inputs arrives early. An input arriving early may be written to the tile memory 325 and then read from the memory 325 when other inputs arrive. For this case, the tile memory 325 is accessed as a FIFO. The FIFO read and write pointers are stored in the tile memory area ram.
Each
The tile memory 325 is typically partitioned into regions. The smaller tile memory area RAM stores the information needed for memory area access. Each region represents a different variable in the kernel. The region may store shared variables (i.e., variables that are shared by all execution threads). The scalar shared variable has an exponent value of zero. The shared variable array has a variable exponent value. The region may store a thread private variable indexed by the TID identifier. Variables may be used to pass data from one synchronization domain to the next. For this case, the variable is written using the XID _ WR identifier in the source synchronization domain and read using the XID _ RD identifier in the destination domain. Finally, the region may be used to temporarily store data that tile 210 previously generated in the synchronous data path until other tile data inputs are ready. For this case, the read and write indices are FIFO pointers. The FIFO pointers are stored in the tile memory area RAM.
The tile memory area RAM typically contains the following fields:
1. the region index is higher: which is the upper bit of the tile memory region index. The lower index bits are obtained from the asynchronous fabric message, TID, XID _ WR, or XID _ RD identifier, or from the FIFO read/write index value. The region index upper bits are ORed with the lower index bits to generate the index for tile memory 325.
2. Region SizeW: which is the width of the lower index of the memory region. The size of the memory area is 2SizeWAnd (4) each element.
3. Area FIFO read index: it is the read index of the memory area acting as a FIFO.
4. Zone FIFO write index: it is the write index of the memory area acting as a FIFO. The tile performs the computing operations of the
The computing operation is performed by configuring a data path within
Various operations performed by the
The inputs to ALB Op310 and
Table 4:
each
Table 5:
name of destination
Destination description
SYNC_U
Synchronous
SYNC_D
Synchronous
SYNC_L
Synchronous
SYNC_R
Synchronous
WRMEM0_Z
Memory
0 is written. The value zero is used as an index to write to an area of the memory 325.
Memory 0 is written. The instruction constant field is used as an index to write to a region of memory 325.
Memory 0 is written. The TID value is used as an index to write to an area of memory 325.
At a high level, and as a representative example, all without limitation, the general operation of
FIG. 10 is a detailed block diagram of a representative embodiment of the memory control circuitry 330 (with associated control registers 340) of the hybrid thread fabric configurable compute circuit (tile) 210. FIG. 10 shows a diagram of the read index logic of the tile memories 325 of the memory control circuit 330, and is replicated (not separately shown) for each memory 325. Instruction RAM315 has a field that specifies which area of tile memory 325 is being accessed, and a field that specifies the access index pattern. Memory area RAM 405 (part of control register 340) specifies an area read mask that provides the upper memory address bits of a particular area. The mask is OR' ed with the lower address bits supplied by the read index select multiplexer 403 (OR gate 408). The memory area RAM 405 also contains a read index value when the tile memory 325 is accessed in FIFO mode. When accessed in FIFO mode, the read exponent value in RAM 405 is incremented and written back. In various embodiments, the memory area RAM 405 may also hold the top of the TID stack by a nested loop, as described below.
Fig. 10 also shows the control information (INSTR, XID, TID) required by the synchronous mesh communication network 275 a few clocks earlier than the data input. For this reason, control information is sent from the
The
table 6:
If the synchronous domain of each
the
The first or "base"
Fig. 11 is a detailed block diagram of a representative embodiment of the thread control circuit 335 (with associated control registers 340) of the hybrid thread fabric configurable compute circuit (tile) 210. Referring to FIG. 11, a number of registers are contained within control registers 340, namely TID pool register 410, XID pool register 415, suspend table 420, and complete table 422. In various embodiments, the data of completion table 422 may be equally saved in pause table 420, and vice versa. Thread control circuitry 335 contains a continue queue 430, a re-entry queue 445, a thread control multiplexer 435, a run queue 440, an iteration increment 447, an iteration index 460, and a loop iteration count 465. Alternatively, the continue queue 430 and run queue 440 may be equivalently embodied in the control register 340.
Fig. 12 is a diagram of the
To run a thread to completion with a fixed pipeline delay, the representative embodiment provides a completion table 422 (or a stall table 420) indexed by the thread's TID at the
As shown in FIG. 12,
Thus, when
This type of thread control has several advantages. This thread control waits for all dependencies to complete before starting a thread, allowing the started thread to have a fixed synchronized execution time. The fixed execution time allows the use of register stages in the entire pipeline, rather than a FIFO. In addition, while one thread of a
Similar control is provided when traversing a synchronization domain, such as for executing multiple threads (e.g., for related compute threads forming a compute fabric). For example, the first synchronization domain will inform the
It should also be mentioned that various delays may need to be implemented, for example both would be needed when the
Pause table 420 is used to hold or pause the creation of a new sync thread in
The continue (or call) queue 430 keeps the thread ready to start on the synchronization domain. When all completions for the call operation are received, the thread is pushed into the continue queue 430. It should be noted that the threads in the continue queue 430 may require that the TID and/or XID be allocated before the threads can start on the synchronization domain, e.g., if all TIDs are in use, the threads in the continue queue 430 may start once the TID is released and available, i.e., the threads may wait until the TID and/or XID are available.
The re-entry queue 445 prepares threads to start on the synchronization domain, with execution of those threads having precedence over those in the resume queue 430. When all completions for continuation operations are received and the thread already has a TID, the thread is pushed into the re-entry queue 445. It should be noted that the thread in the re-entry queue 445 does not need to allocate a TID. Separate re-entry queues 445 and continue/continuation queues 430 are provided to avoid deadlock situations. One particular type of continued operation is looping. The loop message contains a loop iteration count. The count is used to specify the number of times the thread is started once the pause operation is completed.
An optional priority queue 425 may also be implemented such that any thread in the priority queue 425 having a thread identifier executes before any thread in the execution continue queue 430 or the re-entry queue 445 having a thread identifier.
The state of the iteration index 460 is used when starting a thread for loop operations. The iteration index 460 is initialized to zero and starts incrementing for each thread. The iteration index 460 is pushed into the run queue 440 with thread information from the continue queue 430. The iteration index 460 may be used as a selection of the data path input multiplexer 365 within the first tile (base tile) 210 of the synchronization domain.
Loop iteration count 465 is received as part of the loop message, saved in pause table 420, pushed into continue queue 430, and then used to determine when the appropriate number of threads for the loop operation have started.
The run queue 440 holds ready-to-run threads that have an assigned TID and/or XID and can execute upon the occurrence of the appropriate spoke count clock. When a thread starts on the synchronization domain, TID pool 410 provides a unique Thread Identifier (TID) to the thread. Only threads within the continue queue 430 can obtain the TID. The XID pool 415 provides a unique delivery identifier (XID) to a thread when the thread starts on the synchronization domain. The thread from the continue queue 430 may obtain an XID. The assigned XID becomes the XID _ WR for the thread started.
For any given or selected program to be executed, code or instructions of that program, written or generated in any suitable or selected programming language, are compiled for system 100 and loaded into system 100, including instructions of HTP300 and
For example, but not limiting of, the core is started with a job descriptor message containing zero or more arguments, typically generated by the
Host messages sent to a core are sent to all
Control messaging over the
(1)
(2) The
(3) The
The
Table 7:
fig. 16 and 17 provide examples of message passing and thread control within the system 100, where example calculations are provided to illustrate how the synchronous
Fig. 16 is a diagram of a representative hybrid thread fabric configurable compute circuit (tile) 210 forming a synchronous domain and a representative asynchronous packet network messaging for the
At step 510,
At step 514, the
The
Subsequently, at step 528, the
At step 532, once the value is written to the tile memory 325 of the
When execution continues, at step 542,
At step 546,
The
Subsequently, at step 554, the
Once the value is written to the tile memory, an AF-write complete message (560) is sent to the
At step 568, tile 210H within the second synchronization domain performs an addition operation of the B value passed from the host message and the A value read from
At step 572, tile 210J within the second synchronization domain sends an AF message containing the R value to HTF scheduling interface 225 (570). The AF message contains the XID value assigned from
At step 576, the AF message (574) from the second synchronization domain (tile 210K) sends the XID value assigned in the first synchronization domain back to
It should be noted that in this example of fig. 16 and 17, to illustrate the various AF messages that may be used for thread control, a number of
Another message passing example for thread control across multiple synchronous domains is provided in FIGS. 18 and 19, again using AF complete and continue messages over
For this example, the
At step 604, the
At step 614,
At step 618, when the computation continues in the
AF resume message (616) may include the TID or XID _ WR value as an index to pause table 420 on
At step 624, an AF complete message (622) with a TID value (e.g., (first type) TID of 11) is sent to
Next, at
When the computation of the second sync thread (626) is complete, several housekeeping messages are sent through the
Data is transferred from one synchronization domain to the next using a data transfer operation. Typically, data transfer is used in conjunction with load operations that obtain data from
Next, data transfer operations between the synchronization domains utilize a variation of step 624. Instead of sending the AF complete message in step 624, the
Control of the iterative thread loop across the synchronization domains utilizes a similar control messaging pattern. The circular message flow allows multiple synchronization domains to start with a single circular message. Each of the started synchronous threads can access its iteration index. FIG. 20 is a diagram of a representative hybrid thread fabric configurable compute circuit (tile) forming a synchronous domain and a representative asynchronous packet network messaging for the group of hybrid thread fabric circuits to perform cycles in a computation. Fig. 21 is a flow diagram of representative asynchronous packet network messaging and execution by a hybrid thread fabric configurable compute circuit (tile) for a group of hybrid thread fabric circuits to perform a loop in the computation of fig. 20.
FIG. 20 shows three synchronization domains, namely a
Referring again to FIG. 11, control register 340 contains a completion table 422 (or a pause table 420). For loops, two types of completion information are maintained in completion table 422, namely a first completion count related to the number of completion messages that should arrive before a thread can begin, as discussed above, and a second loop or iteration (completion) count for tracking the number of loop threads that have begun and completed. The cycle begins by sending an AF cycle message containing the cycle count (and the respective TIDs, as discussed below) to the
The TID is returned to TID pool 410 by sending an AF message from a tile within the synchronization domain at the termination of the thread, which may be an AF complete message or, for the second embodiment, an AF re-entry message. This may also be achieved by a free TID message to the
Referring to FIGS. 20 and 21, at
Next, at
For example, but not limiting of, for cycles including nested and double nested loops, several additional novel features are utilized in order to minimize idle time, including re-entry queue 445 and additional sub-TIDs, e.g., TIDs for outermost loops2TID for intermediate or mediated cycles1And TID for innermost cycle0. Each thread executing in a loop also has a unique TID, e.g., for an outer loop that will have fifty iterations20-49, which are also used for corresponding completion messages when each iteration completes execution, again by way of example and not limitation.
Referring again to FIG. 11, several novel mechanisms are provided to support efficient cycling and minimize idle time. For example, a loop with a data-dependent end condition (e.g., a "while" loop) requires that the end condition be calculated when the loop is executed. Also, for control and execution of loops, if all TIDs are allocated from TID pool 410, but the thread executing the head of the queue is a new loop that cannot execute due to the missing available TIDs, a potential deadlock problem may arise, preventing other loop threads from completing and releasing their allocated TIDs. Thus, in a representative embodiment, control register 340 includes two separate queues for ready-to-run threads, with a first queue used to initiate a new loop (continue queue 430, also for non-looping threads) and a separate second queue (re-enter queue 445) used for loop continuation. The continue queue 430 allocates TIDs from TID pool 410 to start a thread, as previously discussed. The re-entry queue 445 uses the previously allocated TID as each iteration of the loop thread executes and transmits an AF re-entry message with the previously allocated TID. Any Thread (TID) in the re-enqueue 445 will move to the run queue 440 ahead of Threads (TID) that may be in other queues (continue queue 430). Thus, once the loop starts, loop iteration can proceed extremely quickly, with each subsequent thread of the loop starting quickly using a separate re-entry queue 445, and further, without potential deadlock issues. In addition, the re-entry queue 445 allows this to be performed quickly, which is extremely important for loops with data-dependent end conditions, which can now run efficiently without interruption to the last iteration that produced the data-dependent end condition.
Referring again to fig. 9 and 10, the control register 340 includes a memory area RAM 405. In various embodiments, the memory area RAM405 may also maintain the top of the TID stack (with identifier) through a nested loop, as described below. As mentioned above, each nested loop initiates a thread with a new (or reused) set of TIDs. The thread that is looping may need to access its TID as well as the TID of the outer loop thread. Accessing the TID of each nested loop thread allows access to a private variable of each thread, e.g., TIDs of different levels or types as described above, TIDs0、TID1And TID2. The top of the stack TID identifier indicates the TID of the active thread. Stack top of TID identifiers for selection of use of three TIDs (TIDs)0、TID1And TID2) Which performs the respective operations. The top of the three TIDs and the stacked TID identifier are included in a network passing through a synchronous mesh communication network275 and is therefore known to each thread. Because multiple TIDs are included within the synchronous fabric message and include the top of the stacked TID identifier, the multiple TIDs allow threads in the nested loop to access variables from any level within the nested loop threads. The private thread variable is accessed using the selected TID and the tile memory area RAM405 identifier.
Another novel feature of the present disclosure is a mechanism to order loop thread execution to handle loop iteration dependencies that also accommodates any delays in completion messages and data received over the
To provide for ordered loop thread execution, in a representative embodiment, additional messaging and additional fields are utilized in completion table 422 for each loop iteration. The
In other words, for each thread that has received all data completions (and is therefore ready to run), thread control circuitry 330 (which generally includes various state machines) checks completion table 422 to determine if the thread is the next thread to run (with a next thread ID, e.g., TID 4), if so, moves the thread (TID 4) into run queue 440, if not, does not start the thread (e.g., a thread whose data completion count becomes zero but has a TID 5), but maintains an index of its TID to start next. When the data completion of a thread with the next TID decrements to zero (in this case TID 4) and thus all completion messages have arrived, the thread queues up for execution and the thread to execute (TID 4) also has the next TID, in which case its next TID is
Fig. 24 is a circuit block diagram of a representative embodiment of
However, in the representative embodiment, a mechanism is provided for dynamic self-configuration using
In a representative embodiment, the
Iv. mixed-thread processor 300:
fig. 25 is a high-level block diagram of a representative embodiment of a hybrid thread processor ("HTP") 300. Fig. 26 is a detailed block diagram of a representative embodiment of the thread storage memory 720 (also referred to as thread control memory 720) of the
The HTP300 generally includes one or more processor cores 705, which may be any type of processor core, such as a RISC-V processor core, an ARM processor core, and the like, all by way of example and not limitation. Core control circuitry 710 and core control memory 715 are provided for each processor core 705, and are shown in FIG. 25 for one processor core 705. For example, when multiple processor cores 705 are implemented, such as in one or more HTPs 300, a corresponding plurality of core control circuits 710 and core control memories 715 are also implemented, where each core control circuit 710 and core control memory 715 is used to control a corresponding processor core 705. In addition, one or more of the
Core control circuitry 710, in turn, includes control logic and thread selection circuitry 730 and network interface circuitry 735. The core control memory 715 includes a plurality of registers or other memory circuits, conceptually divided and referred to herein as a thread memory (or thread control memory) 720 and a network response memory 725. For example, and without limitation, the
Referring to FIG. 26, thread storage 720 includes a plurality of registers, including: a thread ID pool register 722 (storing a predetermined number of thread IDs that may be utilized and are typically populated with identification numbers 0 through 31, such as, but not limited to, for a total of 32 thread IDs when the system 100 is configured); thread status (table) registers 724 (store thread information such as valid, idle, suspend, wait for instruction, first (normal) priority, second (low) priority, temporary change of priority when resources are not available); a program counter register 726 (e.g., storing the address or virtual address in instruction cache 740 where the thread starts next); general purpose registers 728 for storing integer and floating point data; a pending fibre return count register 732 (which tracks the number of pending threads that will return to complete execution); a return argument buffer 734 ("RAB", e.g., a header RAB that is a header with a linked list of return argument buffers), a thread return register 736 (e.g., storing a return address, a call identifier, any thread identifier associated with the calling thread); custom atomic transaction identifier register 738; a received event mask register 742 (used to specify which events to "listen" to, as discussed in more detail below); an event status register 744; and a data cache 746 (typically 4 to 8 cache lines of cache memory are provided for each thread). All of the different registers of
Referring to fig. 27, network response memory 725 contains a plurality of registers, such as memory request (or command) register 748 (e.g., a command to read, write, or perform a custom atomic operation); a thread ID and transaction identifier ("transaction ID") register 752 (where the transaction ID is used to track any requests to memory and associate each such transaction ID with the thread ID of the thread that generated the request to memory 125); a request cache line index register 754 (used to specify which cache line in the data cache 746 to write when data is received from the memory of a given thread (thread ID)); register byte register 756 (specifying the number of bytes written to general register 728); and general register index and type registers 758 (indicating which general register 728 to write and whether it is a sign extension or floating point).
As described in more detail below, the HTP300 will receive a job descriptor packet. In response, the HTP300 will find an idle or empty context and initialize a context block, assigning a thread ID to the executing thread (collectively referred to herein as a "thread") if one is available, and placing the thread ID in the execution (i.e., "ready to run") queue 745. The threads in the execution (ready-to-run) queue 745 are typically selected for execution during a round robin or "barrel" selection process, where a single instruction of a first thread is provided to the
Thus, the HTP300 is an "event-driven" processor and will automatically begin thread execution upon receipt of a job descriptor packet (provided that a thread ID is available, but without any other requirement to initiate execution), i.e., arrival of a job descriptor packet automatically triggers the start of local thread execution without reference to the
In addition to
Similarly, when the thread has completed, the HTP300 or
Fig. 28 is a detailed block diagram of a representative embodiment of the
When a work descriptor packet arrives, the control logic and thread selection circuitry 730 assigns an available thread ID from the thread
The thread ID is given a valid status (indicating that it is ready for execution) and the thread ID is pushed to the
Upon completion of executing the instruction, the same triplet information (thread ID, valid status, and priority) may be returned to the execution (ready to run) queue 745 under the control of the control logic and thread selection circuitry 730, depending on various conditions, to continue selection for round robin execution. For example, if the last instruction of the selected thread ID is a return instruction (indicating that thread execution is complete and a return data packet is provided), control logic and thread selection circuitry 730 returns the thread ID to the available thread ID pool in thread
Continuing with the previous example, when the last instruction of the thread ID is the return instruction, the return information (thread ID and return argument) is pushed through the
Continuing with the latter example, the instructions of the thread may be load instructions, i.e., read requests to the
A store request to
Fig. 29 is a flowchart of a representative embodiment of a method for self-scheduling and thread control of an HTP300, and provides a useful overview, in which the HTP300 has been populated with instructions in an
When thread execution is complete, step 822, i.e., when the instruction being executed is a return instruction, the thread ID is returned to the thread
Similarly, the HTP300 may generate calls to create threads on local or remote computing elements to create threads on
Such instructions will also allocate and reserve the associated memory space, such as in return argument buffer 734, if the created thread will have a return argument. If there is insufficient space in return argument buffer 734, the instruction will stall until return argument buffer 734 is available. The number of fibers or threads created is limited only by the amount of space to hold the response arguments. Creating threads without return arguments may avoid reserving return argument space, thereby avoiding a possible suspended state. This mechanism ensures that returns from completed threads always have locations to store their arguments. When returning back to the HTP300 as a data packet on the
As discussed in more detail below, various types of shred join instructions are utilized to determine when all spawned threads are complete, and may be instructions with or without a wait. A count of the number of spawned threads is held in the pending fiber return count register 732, which is decremented when the HTP300 receives a thread return. The join operation may be implemented by copying the return into a register associated with the generated thread ID. If the join instruction is a wait instruction, it will remain in a halted state until the return of the thread ID specifying the spawned thread arrives. During this time, other instructions are executed by the
The thread return instruction may also be used as an instruction after the shred creation instruction, rather than the join instruction. A thread return instruction may also be executed when the count in the pending fibre return count register 732 reaches zero and the last thread return packet is received, and indicates that the fibre create operation has completed and all returns have been received, allowing the thread ID, return argument buffer 734 and linked list to be freed for other purposes. In addition, it may also generate and transmit a work descriptor return packet (e.g., with result data) to a source called the primary thread (e.g., an identifier or address of the source that generated the call).
All join instructions do not need to return arguments as long as an acknowledgement is made that the count in the pending fibre return count register 732 is decremented. When the count reaches zero, the thread restarts because all joins are now complete.
Communication between processing elements is required to facilitate the processing of parallel algorithms. The representative embodiments provide an efficient means for threads of a set of processing resources to communicate using various event messages, which may also include data (e.g., arguments or results). Event messaging allows any
Event messaging supports point-to-point and broadcast event messages. Each processing resource (HTP 300) may determine when a received event operation is complete and the processing resource should be notified. The event reception mode includes simple (single received event completes the operation), collecting (counters are used to determine when enough events have been received to complete the operation), and broadcasting (event complete events received on a particular channel). Additionally, an event may be sent with an optional 64-bit data value.
The HTP300 has a set of event receipt status consisting of a 2-bit receipt mode, a 16-bit counter/channel number, and a 64-bit event data value, stored in the event status register 744. The HTP300 may have multiple sets of event receipt status for each thread context, with each set indexed by an event number. Thus, events may be for a particular thread (thread ID) and event number. The sent event may be a point-to-point message with a single destination thread or a broadcast message sent to all threads within a set of processing resources belonging to the same process. When such an event is received, the suspended or sleeping thread may be reactivated to resume processing.
This use of the event status register 744 is much more efficient than a standard Linux-based host processor that can send and receive events over an interface that allows the
The point-to-point message will specify the event number and destination (e.g., node number, which HTP300, which core, and which thread ID). On the receive side, the HTP300 will be configured or programmed with one or more event numbers that are saved in the event status register 744. If the HTP300 receives the event information with the event number, it is triggered and transitions from the halt state to the active state to resume execution, e.g., execute the event received instruction (e.g., EER, infra). The instruction will then determine whether the correct event number was received and, if so, write any associated 64-bit data into the general register 728 for use by another instruction. If the event has received instruction execution and has not received the correct event number, it will pause until the particular event number is received.
An event listen (EEL) instruction may also be utilized, where an event mask is stored in the event received mask register 742 indicating one or more events to be used to trigger or wake a thread. When event information arrives with any of those specified events, the receiving HTP300 will know which event number triggered, e.g., which other process may have completed, and will receive event data from those completed events. The event listen instruction may also have a wait and no wait change, as discussed in more detail below.
For event messaging in the gather mode, the receiving HTP300 will gather (wait for) a set of receive events before triggering, setting the count in the event status register 744 to the required value, which is decremented when the required event message is received, and triggers when the count decrements to zero.
In the broadcast mode, the sender processing resource may transmit a message to any thread within the node. For example, a sending HTP300 may transmit a series of point-to-point messages to every other HTP300 within a node, and then each receiving HTP300 passes the message to each internal core 705. Each core control circuit 710 will examine its thread list to determine whether it corresponds to the event number it was initialized to receive and to determine which lane may have been designated on the
This broadcast mode is particularly useful when thousands of threads can be executed in parallel, with the last thread executed transmitting broadcast event information indicating completion. For example, a first count of all threads that need to complete may be saved in the event status register 744, while a second count of all threads that have already executed may be saved in the
As mentioned above, while the HTP300 may utilize standard RISC-V instructions, it is noted that an extended instruction set may be provided to utilize all of the system 100's computing resources, as discussed in more detail below. The thread created from the
The new load instruction:
the HTP300 has a relatively few read/write buffers, also referred to as data cache registers 746, for each thread. A buffer (data cache register 746) temporarily stores shared memory data for use by its own threads. The data cache 746 is managed by a combination of hardware and software. The hardware automatically allocates buffers and reclaims data as needed. Using RISC-V instructions, software determines which data should be cached (read and write data), and when the data cache register 746 should be invalidated (if valid) or written back to memory (if invalid). The RISC-V instruction set provides the FENCE instruction and fetches and releases indicators on the atomic instructions.
The standard RISC-V load instruction automatically uses the read data cache register 746. The standard load checks to determine if the required data is in the existing data cache register 746. If so, then data is obtained from the data cache register 746 and the execution thread can continue execution without suspension. If the desired data is not in the data cache register 746, the HTP300 looks up the available data cache register 746 (data needs to be evicted from the buffer) and reads 64 bytes from memory into the data cache register 746. The execution thread is suspended until the memory read is complete and the load data is written into the RISC-V register.
Read buffering has two main benefits: 1) greater access to the
The new load instructions provide "probabilistic" caching based on expected access frequency for frequently used data versus infrequently or infrequently used data. This is particularly important when used with sparse data sets, which if placed into the data cache register 746 will also overwrite other data needed more frequently, effectively polluting the data cache register 746. The new load instruction (NB or NC) allows frequently used data to be held in the data cache registers 746, while infrequently used (sparse) data that would normally be cached is actually designated for uncached storage in the general purpose registers 728.
This type of instruction has an NB suffix (unbuffered) (or equivalently, an NC suffix (unbuffered):
LB.NB RA,40(SP)。
NB (NC) load instructions are intended for writing to the runtime library in the program set.
In Table 8, the following load instructions are added as 32-bit instructions, where Imm is the immediate field, RA is the register name, rs1 is the source index, rd is the destination index, and the bits in fields 14-12 and 6-0 specify the instruction.
Table 8:
bandwidth to memory is often the primary factor limiting application performance. The representative embodiments provide a means to inform the HTP300 about the size of the memory load request that should be issued to the
There is another optimization where the application is aware of the size of the data structure accessed and may specify the amount of data to load into the data cache register 746. As an example, if the algorithm uses a 16 byte size structure, and the structure is scattered in memory, it would be optimal to issue a 16 byte memory read and place the data into the data cache register 746. The representative embodiment defines a set of memory load instructions that provide the size of the operands to be loaded into the HTP300 registers and the size of the access to memory in the event of a load miss to the data cache register 746. The actual load of the
When the requested data is less than a cache line, the load instruction may also request additional data that is not currently needed by the HTP300 but may be needed in the future, which is worth obtaining at the same time (e.g., as a prefetch), optimizing read size access to the
Thus, the representative embodiments minimize wasted bandwidth by requesting only memory data that is known to be needed. The result is an increase in application performance.
A set of load instructions has been defined that allow specifying the amount of data to be accessed. Data is written to the buffer and invalidated by eviction, FENCE, or atom with specified fetch. The load instruction provides a hint as to how much additional data (in 8 byte increments) will be accessed from memory and written to the memory buffer. The payload will only access the extra data for the next 64 byte boundary. The load instruction specifies the number of additional 8 byte elements that are loaded using the operation suffixes RB0-RB 7:
LD.RB7 RA,40(SP)
the instruction format is shown in table 9. The number of 8 byte data elements to be loaded into the buffer is specified as bits 6 and 4:3 in the 32-bit instruction. These load instructions may be used to assemble a write routine or, ideally, assemble by a compiler. It is expected that a set of programs that were initially written only manually will utilize these instructions.
Table 9:
new store instruction
The HTP300 has a small amount of memory buffer that temporarily stores shared memory data. The memory buffer allows multiple writes to memory to be consolidated into a smaller number of memory write requests. This has two benefits: 1) fewer write requests are more efficient for the
Standard RISC-V store instructions write data to the HTP300 store buffer. However, there are situations where it is known that it is preferable to write data directly to memory and not to a memory buffer. One such situation is a scatter operation. Scatter operations typically write only a single data value to the memory buffer. Writing to the buffer thrashing the buffer (thrash) and forcing other stored data that is beneficial for writing the merge back to memory. A set of store instructions is defined for the HTP300 indicating that write buffering should not be used. These instructions write data directly to the
Store unbuffered instructions are expected for manually assembled libraries and indicated with an NB suffix:
ST.NB RA,40(SP)
the following store instructions are added as shown in table 10.
Table 10:
custom atomic store and Clear Lock (CL) instruction:
when the memory controller observes an atomic operation, the custom atomic operation sets a lock on the provided address. Atomic operations are performed on the associated
The following instruction sequences may be used to implement custom atomic DCAS operations:
// a 0-atomic Address
64-bit memory value of// a1-a0
// a2-DCAS comparison 1
// a3-
// a4-DCAS exchange value 1
// a5-
atomic_dcas:
bne a1, a2, fail// first 8 byte comparison
Nb a6,8(a0)// load second 8 byte memory value-should hit in memory cache
bne a6, a3, fail// second 8 byte comparison
sd a4,0(a0)// store the first 8-byte swap value to the thread store buffer
Cl a5,8(a0)// store the second 8 byte value and clear the memory lock
eft x0// AMO successful response
fail:
li a1,1
Cl a1, (a0)// AMO fail response (and clear memory lock)
The store instruction indicating that the lock should be cleared is:
SB.CL RA,40(SP)
SH.CL RA,40(SP)
SW.CL RA,40(SP)
SD.CL RA,40(SP)
FSW.CL RA,40(SP)
FSD.CL RA,40(SP)
the format of these store instructions is shown in table 11.
Table 11:
atomic_float_add:
d.d a2, a1, a2// a1 contain memory values, a2 contain the value to be added
fsd. cl a2,0(a0)// a0 contains memory address, clears lock and terminates atom
eft// evicting all lines from the buffer, terminating the atomic thread
A thread creation instruction:
a fiber creation ("EFC") instruction initiates a thread on HTP300 or
EFC.HTP.A4
EFC.HTF.A4
This instruction executes a call on the HTP300 (or HTF200), starting execution at the address in register a 0. (optionally, a suffix: DA. instruction suffix DA may be utilized to indicate that the target HTP300 is determined by a virtual address in register a1 if the DA suffix is not present, then the target is HTP300 on the local system 100.) suffixes a1, a1, a2, and a4 specify the number of additional arguments to be passed to the HTP300 or
It should be noted that if the return buffer is not available at the time the instruction is executed, the EFC instruction will wait until the return argument buffer is available to begin execution. Once the EFC instruction successfully creates a shred, the thread continues at the instruction immediately following the EFC instruction.
It should also be noted that the thread created by
Table 12:
thread return instruction:
the thread return (ETR) instruction passes the argument back (either through thread creation by
ETR.A2
This instruction execution returns to the HTP300 or the
Table 13:
a fiber adding instruction:
the fiber join (EFJ) instruction checks whether the created fiber has returned. The instructions have two variations: join wait and not wait. Waiting for the change will cause thread execution to pause until the fiber has returned. Joining does not wait to not suspend thread execution, but rather provides a success/failure status. For both variants, if the instruction is executed without a pending fibre return, an exception is generated.
The arguments from the return fibre (up to four) are written to registers a0-a 3.
EFJ
EFJ.NW
The format of these fibre join instructions is shown in table 14.
Table 14:
all fiber adding instructions:
all of the fibre join instructions (efj. all) are pending until all pending fibres are returned. The instruction may be invoked with zero or more pending fibre returns. Instruction states and exceptions are not generated. All return arguments from the fibre return are ignored.
EFJ.ALL
The format of these fibre join instructions is shown in table 15.
Table 15:
atomic return instruction:
an atomic return instruction (EAR) of system 100 is used to complete the execution thread of the custom atomic operation and possibly provide a response back to the source that issued the custom atomic request.
The EAR instruction may send zero, one, or two 8-byte argument values back to the issuing compute element. The number of arguments to be sent back is determined by the ac2 suffix (a1 or a 2). No suffix means zero argument, a1 means a single 8-byte argument, and a2 means two 8-byte arguments. Arguments are obtained from the X registers a1 and a2, if needed.
The EAR instruction is also capable of clearing a memory line lock associated with the atomic instruction. The EAR uses the value in the a0 register as the address to send the clear lock operation. If the instruction contains the suffix CL, a clear lock operation is issued.
The following DCAS instances use the EAR instruction to send back a success or failure to the requesting processor:
// a 0-atomic Address
64-bit memory value of// a1-a0
// a2-DCAS comparison 1
// a3-
// a4-DCAS exchange value 1
// a5-
atomic_dcas:
bne a1, a2, fail// first 8 byte comparison
Nb a6,8(a0)// load second 8 byte memory value-should hit in memory cache
bne a6, a3, fail// second 8 byte comparison
sd a4,0(a0)// store the first 8-byte swap value to the thread store buffer
Cl a5,8(a0)// store the second 8 byte value and clear the memory lock
li a1,0
A1// AMO successful response
fail:
li a1,1
Cl. a1// AMO failure response (and clearing memory lock)
The instruction has two variations that allow the EFT instruction to also clear the memory lock associated with the atomic operation. The format of the supported instructions is shown in table 16.
Table 16:
first and second priority instructions:
the second (or low) priority instruction transitions the current thread having the first priority to a second low priority. The instruction is typically used when a thread polls for an event to occur (i.e., a barrier).
ELP
The format of the ELP instruction is shown in Table 17.
Table 17:
the first (or high) priority instruction transitions the current thread having the second (or low) priority to the first (or high or normal) priority. The instruction is typically used when a thread is polling and an event has occurred (i.e., a barrier).
ENP
The format of the ENP instruction is shown in Table 18.
Table 18:
floating point atomic memory operation:
floating point atomic memory operations are performed by HTP300 associated with
The aq and rl bits in the instruction specify whether all written data should be visible to other threads before the atomic operation is issued (aq) and whether all previously written data should be visible to this thread after the atomic operation is completed (rl). In other words, the aq bit forces all write buffers to be written back to memory, and the rl bit forces all read buffers to be invalidated. Note that rs1 is the X register value, and rd and rs2 are the F register values.
AMOFADD.S rd,rs2,(rs1)
AMOFMIN.S rd,rs2,(rs1)
AMOFMAX.S rd,rs2,(rs1)
AMOFADD.D rd,rs2,(rs1)
AMOFMIN.D rd,rs2,(rs1)
AMOFMAX.D rd,rs2,(rs1)
The format of these floating point atomic memory operation instructions is shown in table 19.
Table 19:
custom atomic memory operation:
the custom atomic operation is performed by HTP300 associated with
Up to 32 custom atomic operations may be used within
The aq and rl bits in the instruction specify whether all written data should be visible to other threads before the atomic operation is issued (aq) and whether all previously written data should be visible to this thread after the atomic operation is completed (rl). In other words, the rl bit forces all write buffers to be written back to memory, and the aq bit forces all read buffers to be invalidated.
The custom atom specifies the memory address using the a0 register. The number of derived variables is provided by the suffix (a0, a1, a2, or a4) and is obtained from registers a1-a 4. The number of result values returned from memory may be 0-2 and is defined by the custom memory operation. The result value is written to registers a0-a 1.
AMOCUST0.A4
As shown in table 20, the following custom atomic instructions are defined.
Table 20:
the ac field is used to specify the number of arguments (0, 1, 2, or 4). Table 21 below shows the encoding.
Table 21:
eight custom atomic instructions are defined, with each custom atomic instruction having 4 argument count variations, resulting in a total of 32 possible custom atomic operators.
Event management:
the system 100 is an event-driven architecture. Each thread has a set of events that can be monitored using the event received mask register 742 and the event status register 744.
The event triggered bit may be cleared (using the EEC instruction) and all events may be listened (using the EEL instruction). The listening operation may suspend the thread until an event is triggered, or in a non-wait mode (. NW), allow the thread to periodically poll while other execution continues.
A thread can send an event to a particular thread using an event send instruction (EES) or broadcast an event to all threads within a node using an event broadcast instruction (EEB). A broadcast event is a named event in which the sending thread specifies an event name (16-bit identifier) and the receiving thread screens for a pre-specified event identifier in the received broadcast event. Once received, the event should be explicitly cleared (EEC) to avoid receiving the same event again. Note that all event-triggered bits are cleared when the thread begins execution.
An event mode instruction:
an event mode (EEM) instruction sets an operation mode of an event.
In simple mode, a received event immediately sets the bit of the trigger and increments the received message count one by one. Each newly received event increments the received event count. The receive event instruction (EER) decrements the received event count one by one. The event-triggered bit is cleared when the count goes back to zero.
In the broadcast mode, the channel of the received event is compared with the broadcast channel of the event number. If the lanes match, then the event triggered bit is set. The EER instruction causes the triggered bit to be cleared.
In the collection mode, the event received causes the event trigger count to be decremented one by one. When the count reaches zero, the event triggered bit is set. The instruction causes the bit of the trigger to be cleared.
The EEM instruction prepares the event number for the selected mode of operation. In simple mode, the 16-bit event counter is set to zero. For broadcast mode, the 16-bit event channel number is set to the value specified by the EEM instruction. For gather mode, the 16-bit event counter is set to the value specified by the EEM instruction. Each of these three modes uses the same 16-bit value in a different manner.
Bm rs1,
Cm rs1,
Sm rs 1; rs1 event number
The format of the event mode instruction is shown in table 22.
Table 22:
event destination instruction:
an event destination (EED) instruction provides an identifier of an event within an execution thread. The identifier is unique among all threads of execution within the node. The identifier may be used with an event send instruction to send an event to a thread using an EES instruction. The identifier is an opaque value that contains the information needed to send an event from a source thread to a particular destination thread.
The identifier may also be used to obtain a unique value for transmitting the broadcast event. The identifier contains a space of event numbers. The input register rs1 specifies the event number to encode within the destination thread identifier. After the instruction executes, the output rd register contains an identifier.
EED rd,rs1
The format of the event destination instruction is shown in table 23.
Table 23:
event destination instructions may also be used by a process to obtain its own address, which may then be used for other broadcast messages, e.g., to enable the process to receive other event messages as destinations, e.g., for receiving return messages when the process is the primary thread.
An event sending instruction:
an event issue (EES) instruction issues an event to a particular thread. Register rs1 provides the destination thread and event number. Register rs2 provides optional 8 bytes of event data.
EES rs1
EES.A1 rs1,rs2
The rs2 register provides the target HTP300 for the event send operation. Register rs1 provides the number of events to be sent. The normal value of rs1 is 2-7. The format of the event send instruction is shown in table 24.
Table 24:
event broadcast instructions:
an event broadcast (EEB) instruction broadcasts an event to all threads within a node. Register rs1 provides the broadcast channel (0-65535) to be sent. Register rs2 provides optional 8 bytes of event data.
EEB rs1
EEB.A1 rs1,rs2
The format of the event broadcast instructions is shown in table 25.
Table 25:
an event listening instruction:
an event listen (EEL) instruction allows a thread to monitor the status of a received event. The instructions may operate in one of two modes: wait and not wait. The wait mode suspends the thread until an event is received, and the wait mode provides the received event while executing the instruction.
EEL rd,rs1
EEL.NW rd,rs1
Register rs1 provides a mask of available events as the output of the listen operation. If there are no events available, then the wait-not mode returns a value of zero in rs 1. The format of the event listen command is shown in table 26.
Table 26:
an event receiving instruction:
an event receive (EER) instruction is used to receive an event. Receiving an event includes confirming that an event was observed and receiving optional 8 bytes of event data. Register rs1 provides the event number. Register rd contains the optional 8 bytes of event data.
EER rs1
EER.A1 rd,rs1
The format of the event reception instruction is shown in table 27.
Table 27:
the HTP300 instruction format is also provided for call, fork, or pass instructions, as previously discussed.
Sending a calling instruction:
the thread sends a call instruction to initiate a thread on the HTP300 or HTF200 and suspends the current thread until the remote thread performs a return operation:
HTSENDCALL.HTP.DA Ra,Rb,Args.
the thread sends a call instruction to execute a call on the HTP300, beginning execution at an address in register Ra. The instruction suffix DA indicates that the target HTP300 is determined by the virtual address in register Rb. If the DA suffix is not present, then the target is the HTP300 on the local node. The constant integer value Args identifies the number of additional arguments to be passed to the
table 28:
thread fork instruction:
the thread fork instruction initiates a thread on the HTP300 or HTF200 and continues with the current thread:
HTSENDFORK.HTF.DA Ra,Rb,Args.
the thread fork instruction executes a call on the HTF200 (or HTP 300), starting execution at the address in register Ra. The instruction suffix DA indicates that the target HTF200 is determined by the node ID within the virtual address in register Rb. If the DA suffix is not present, then the target is the HTF200 on the local node. The constant integer value Args identifies the number of additional arguments to be passed to the remote HTF. Args is limited to values of 0 to 4 (e.g., the package should be able to hold 64B). The additional arguments come from the register state. It should be noted that if the return buffer is not available when the HTSENDFORK instruction is executed, the HTSENDFORK instruction will wait until the buffer is available to begin execution. Once HTSENDFARK is complete, execution of the thread continues at the instruction immediately following the HTSENDFARK instruction. The thread fork instruction sends a packet for the
table 29:
the thread passes the instruction:
the thread pass instruction initiates a thread on the HTP300 or HTF200 and terminates the current thread:
HTSENDXFER.HTP.DA Ra,Rb,Args.
the thread pass instruction performs a pass to the HTP300 and begins execution at the address in register Ra. The instruction suffix DA indicates that the target HTP300 is determined by the virtual address in register Rb. If the DA suffix is not present, then the target is the HTP300 on the local node. The constant integer value Args identifies the number of additional arguments to be passed to the
table 30:
receiving a return instruction:
the thread receives a return instruction HTRECVRTN. The WT checks if a thread return has been received. If a WT suffix is present, then receiving a return instruction will wait until a return is received. Otherwise, the testable condition code is set to a state indicating an instruction. When a return is received, the returned argument is loaded into a register. The instruction immediately following the HTRECVRTN instruction is executed after the return instruction completes.
Fig. 30 is a detailed block diagram of a representative embodiment of the control logic of HTP300 and thread
As mentioned above, a pair of instructions ENP and ELP are used to transition a thread from a first priority to a second priority (ELP) and vice versa.
A thread in a parallel application typically has to wait for other threads to complete the priority of resuming execution (i.e., barrier operations). The wait operation is accomplished by communication between threads. This communication may be supported by an event that wakes up a suspended thread or by waiting for a thread to poll the memory device. When threads poll, their work must be completed to allow all threads to resume high-yield execution resulting in wasted processing resources available to the threads. The second or
The configuration register is used to determine the number of high priority threads to be run for each low priority thread, shown in FIG. 30 as a low priority skip count, which is provided to thread
Fig. 32 is a detailed block diagram of a representative embodiment of data path control circuitry 795 of
Application performance is typically limited by the bandwidth available to the processor in memory. Performance limitations may be alleviated by ensuring that only data needed by the application enters the
As mentioned above, the computing resources of system 100 may have many applications that use sparse data sets that frequently access small blocks of data distributed throughout the data set. Thus, if the amount of data accessed is significant, much of the data may be unused, resulting in wasted bandwidth. For example, a cache line may be 64 bytes, but it is not fully used. At other times, it may be beneficial to use all available bandwidth, for example, for efficient power usage. The data path control circuitry 795 provides dynamic adaptive bandwidth through the
The data path control circuitry 795 monitors the utilization level on the
Fig. 33 is a detailed block diagram of representative embodiments of the
The system call work descriptor packet assembled and transmitted by the
The
However, the
Each processor core 705 includes a
When those system call job descriptor packets are processed by the
Alternatively, the
A mechanism for thread status monitoring is also provided to collect the status of a set of threads running on the HTP300 in hardware, thereby enabling the programmer to see the work of the application. For example, if this feature is present,
All thread state changes may be monitored and statistics may be saved with respect to the amount of time in each state. The processor (110 or 300) that is collecting the statistical data provides a means for a separate second processor (110 or 300) to access and store the data. When the application is running, data is collected such that reports showing the amount of time in each state of the periodic reports can be provided to the application analyst, thereby providing detailed visibility into the running application for subsequent use by the application analyst.
According to a representative embodiment, which may be implemented in hardware or software, all information related to a thread is stored in various registers of the
InStateCount[N]–6b
InStateTimeStamp[N]–64b
InStateTotalTime[N]–64b
v. system memory and virtual addressing:
the system 100 architecture provides a partitioned global address space across all nodes within the system 100. Each node has a portion of the memory of the shared physical system 100. The physical memory of each node is partitioned into local private memory and globally shared distributed memory.
The local
The distributed shared memory of system 100 is accessible by all computing elements (e.g., HTF200 and HTP 300) within all nodes of system 100. The processing elements of system 100 do not have a cache for shared memory, but may have read/write buffers with invalidation/flushing controlled by software to minimize access to the same memory line. The RISC-V ISA provides fence (fence) instructions that can be used to indicate that a memory buffer invalidation/flushing is required. Similarly, the HTF200 supports write suspend operations to indicate that all write operations to memory have completed. These write pause operations may be used to empty the read/write buffer.
The
The
As mentioned above, in representative embodiments, the process virtual address space of system 100 maps to physical memory on one or more physical nodes of system 100. The architecture of system 100 includes the concept of "virtual" nodes. The virtual address of the system 100 contains a virtual node identifier. The virtual node identifier allows the requesting computing element to determine whether the virtual address refers to a local node memory or a remote node memory. A virtual address referring to a local node memory is translated to a cost node physical address by a requesting computing element. A virtual address referring to a remote node memory is sent to the remote node, where upon entering the node, the virtual address translates to a remote node physical address.
The concept of virtual nodes allows processes to use the same set of virtual node identifiers regardless of what physical node the application is actually executing on. The range of virtual node identifiers for a process starts at zero and increases to a value of N-1, where N is the number of virtual nodes in the process. The number of virtual nodes in the process is determined at run-time. The application makes a system call to obtain the physical node. The operating system then determines how many virtual nodes the process will have. The number of physical nodes given to a process is limited by the number of physical nodes in system 100. The number of virtual nodes may be equal to or greater than the number of physical nodes, but must be a power of two. Having a large number of virtual nodes allows the
Having more virtual nodes than physical nodes within a process implies that multiple virtual nodes are assigned to one physical node. The compute elements of the node will each have a small local node virtual node ID table of processes. There will be a maximum number of virtual node IDs per physical node ID. For example, the maximum number of virtual node IDs per physical node ID may be eight, allowing memory and bandwidth to be fairly consistent for different physical nodes without oversizing the virtual node ID table for each computing element.
The architecture of system 100 defines a single common virtual address space for use by all computing elements. This common virtual address space is used by all threads executing on the computing elements (
The virtual address of the current generation of x86 processors is 64 bits wide. However, for these 64-bit addresses, only the lower 48 bits are implemented. The upper 16 bits must contain the lower 48 bits of the sign extension value. According to existing processor limitations, the virtual address space of an application running on a standard Linux operating system is split into virtual addresses where the upper bits are all zeros or all 1's. FIG. 37 illustrates a virtual address space format supported by the architecture of the system 100.
The virtual addresses of the system 100 are defined to support a full 64-bit virtual address space. The upper three bits of the virtual address are used to specify the address format. The format is defined in table 31.
Table 31:
virtual address format ID description
The anomalies noted in table 31 may occur due to two situations: (1) the private address is sent to the remote node HTP or HTF computing element as an argument of the sent call or return operation, or (2) a data structure in shared memory creates a pointer to private memory.
Fig. 38 shows a conversion process of each virtual address format. Referring to fig. 37 and 38:
(a) formats 0 and 7 are used by the
(b) Formats 1 and 6 are used by the local
(c)
(d) Format 3 is used by all
(e) Formats 4 and 5 are not used and in representative embodiments, these formats are illegal, which may generate a reference exception.
Many of the advantages of the representative embodiments are readily apparent. Representative apparatus, systems, and methods provide a computing architecture capable of providing a high performance and energy efficient solution for compute intensive cores, for example, to compute Fast Fourier Transforms (FFTs) and Finite Impulse Response (FIR) filters for sensing, communication, and analysis applications, such as synthetic aperture radar, 5G base stations, and graphic analysis applications, such as, but not limited to, graphic clustering using spectral techniques, machine learning, 5G networking algorithms, and large die codes.
Notably, the various representative embodiments provide a multi-threaded coarse-grained configurable computing architecture that can be configured for any of these different applications, but most importantly, is also capable of self-scheduling, dynamic self-configuration and self-reconfiguration, conditional branching, backpressure control for asynchronous signaling, ordered and loop thread execution (including data dependencies), automatic start of thread execution after data dependency and/or ordering is completed, providing loop access to private variables, providing fast execution of loop threads using reentrant queues, and advanced loop execution using various thread identifiers, including nested loops.
Also, the representative devices, systems, and methods provide a processor architecture capable of self-scheduling, massively parallel processing, and further interacting with and controlling a configurable computing architecture to execute any of these different applications.
As used herein, a "processor core" 705 may be any type of processor core and may be embodied as one or more processor cores configured, designed, programmed, or otherwise adapted to perform the functionality discussed herein. As used herein, a "processor" 110 may be any type of processor and may be embodied as one or more processors configured, designed, programmed or otherwise adapted to perform the functionality discussed herein. When the term processor is used herein, the
Depending on the embodiment chosen, the
As noted above, the
The software, metadata or other source code and any resulting bit files (object code, databases or look-up tables) of the invention may be embodied within any tangible non-transitory storage medium (e.g., any of computer or other machine-readable data storage media) as computer-readable instructions, data structures, program modules or other data, such as discussed above with respect to
The
The present disclosure is to be considered as an exemplification of the principles of the invention and is not intended to limit the invention to the specific embodiments illustrated. In this regard, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the above and below, shown in the drawings, or described in the examples. The systems, methods, and apparatus according to the present invention are capable of other embodiments and of being practiced and carried out in various ways.
Although the present invention has been described with respect to specific embodiments thereof, these embodiments are merely illustrative, and not restrictive of the invention. In the description herein, numerous specific details are provided, such as examples of electronic components, electronic and structural connections, materials, and structural changes, to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that an embodiment of the invention can be practiced without one or more of the specific details, or with other apparatus, systems, assemblies, components, materials, parts, and/or the like. In other instances, well-known structures, materials, or operations are not specifically shown or described in detail to avoid obscuring aspects of embodiments of the invention. Furthermore, the various drawings are not to scale and should not be taken as limiting.
Reference throughout this specification to "one embodiment," an embodiment, "or" particular "embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment of the present invention, but not necessarily in all embodiments, and further, does not necessarily refer to the same embodiment. Furthermore, the particular features, structures, or characteristics of any specific embodiment of the present invention may be combined in any suitable manner and in any suitable combination with one or more other embodiments, including using the selected features and not using the other features accordingly. In addition, many modifications may be made to adapt a particular application, situation or material to the essential scope and spirit of the present invention. It is to be understood that other variations and modifications of the embodiments of the present invention described and illustrated herein are possible in light of the teachings herein and are to be considered as part of the spirit and scope of the present invention.
To recite a range of values herein, each intervening value, to the same degree of accuracy, is explicitly recited. For example, for the range 6-9, the values 7 and 8 are encompassed in addition to 6 and 9, and for the range 6.0-7.0, the values 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, and 7.0 are expressly encompassed. In addition, each intervening subrange within the ranges in any combination is contemplated as being within the scope of the disclosure. For example, for the range of 5-10, sub-ranges of 5-6, 5-7, 5-8, 5-9, 6-7, 6-8, 6-9, 6-10, 7-8, 7-9, 7-10, 8-9, 8-10, and 9-10 are contemplated as being within the scope of the disclosed ranges.
It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as may be useful in accordance with a particular application. It is also within the scope of the invention for the components to be integrally formed into combinations, particularly for embodiments in which the separation or combination of discrete components is not clear or readily discernible. Additionally, as used herein, the term "coupled," including its various forms ("coupled" or "coupled"), refers to and includes any direct or indirect electrical, structural, or magnetic coupling, connection, or attachment, or adaptations or capabilities thereof for such direct or indirect electrical, structural, or magnetic coupling, connection, or attachment, including components that are integrally formed and components that are coupled via or through another component.
With respect to signals, what is referred to herein is a parameter that "represents" a given metric or "represents" a given metric, where a metric is a measure of the state of at least a portion of the regulator or its input or output. A parameter is considered to represent a metric if it is sufficiently directly related to the metric that the adjustment parameter can satisfactorily adjust the metric. If a parameter represents multiple or a portion of a metric, the parameter may be considered an acceptable representation of the metric.
Moreover, any signal arrows in the drawings/figures should be considered only as exemplary, and not limiting, unless otherwise specifically noted. Combinations of components of steps will also be considered within the scope of the invention, especially if the ability to separate or combine is unclear or foreseeable. The disjunctive term "or" as used herein and throughout the appended claims is generally intended to mean "and/or" that has both a conjunctive and disjunctive meaning (and is not limited to the exclusive or meaning) unless otherwise indicated. As described herein and used throughout the appended claims, "a" and "the" include plural references unless the context clearly dictates otherwise. Likewise, as used in the description herein and throughout the appended claims, the meaning of "in … …" includes "in … …" and "on … …" unless the context clearly dictates otherwise.
The foregoing description of illustrated embodiments of the invention, including what is described in the summary or abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed herein. From the foregoing, it will be observed that numerous variations, modifications and substitutions are contemplated and may be made without departing from the spirit and scope of the novel concepts of the present invention. It is to be understood that no limitation with respect to the specific methods and apparatus illustrated herein is intended or should be inferred. It is, of course, intended to cover by the appended claims all such modifications as fall within the scope of the claims.