System with hybrid-threaded processor, hybrid-threaded fabric with configurable computing elements, and hybrid interconnect network

文档序号：1191906 发布日期：2020-08-28 浏览：16次中文

阅读说明：本技术 具有混合线程处理器的系统、具有可配置计算元件的混合线程组构以及混合互连网络 (System with hybrid-threaded processor, hybrid-threaded fabric with configurable computing elements, and hybrid interconnect network ) 是由 T·M·布鲁尔于 2018-10-31 设计创作，主要内容包括：公开用于可配置计算的代表性设备、方法和系统实施例。在代表性实施例中,一种系统包含：第一互连网络；处理器,其耦合到所述互连网络；主机接口,其耦合到所述互连网络；以及至少一个可配置电路群,其耦合到所述互连网络,所述可配置电路群包含：布置成阵列的多个可配置电路；第二异步包网络,其耦合到所述阵列的所述多个可配置电路中的每个可配置电路；第三同步网络,其耦合到所述阵列的所述多个可配置电路中的每个可配置电路；存储器接口电路,其耦合到所述异步包网络和所述互连网络；以及调度接口电路,其耦合到所述异步包网络和所述互连网络。(Representative device, method, and system embodiments for configurable computing are disclosed. In a representative embodiment, a system comprises: a first interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and at least one configurable circuit group coupled to the interconnection network, the configurable circuit group including: a plurality of configurable circuits arranged in an array; a second asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array; a third synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; a memory interface circuit coupled to the asynchronous packet network and the interconnection network; and scheduling interface circuitry coupled to the asynchronous packet network and the interconnection network.)

1. A system, comprising:

a first interconnection network;

a processor coupled to the interconnection network;

a host interface coupled to the interconnection network; and

at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising:

a plurality of configurable circuits arranged in an array;

a second asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array;

a third synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array;

a memory interface circuit coupled to the asynchronous packet network and the interconnection network; and

a scheduling interface circuit coupled to the asynchronous packet network and the interconnection network.

2. The system of claim 1, wherein the interconnection network comprises:

a first plurality of crossbars having a folded Clos configuration and a plurality of direct mesh connections at interfaces with system endpoints.

3. The system of claim 2, wherein the asynchronous packet network comprises:

a second plurality of crossbars, each crossbar coupled to at least one configurable circuit of the plurality of configurable circuits of the array and another crossbar of the second plurality of crossbars.

4. The system of claim 3, wherein the synchronization network comprises:

a plurality of direct point-to-point connections coupling adjacent configurable circuits in the array of the plurality of configurable circuits of the group of configurable circuits.

5. The system of claim 1, wherein each configurable circuit of the plurality of configurable circuits comprises:

a configurable computing circuit;

a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit;

a thread control circuit; and a plurality of control registers;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronization network inputs coupled to the configurable computing circuitry and the synchronization network;

a plurality of synchronization network outputs coupled to the configurable computing circuitry and the synchronization network;

an asynchronous network input queue coupled to the asynchronous packet network;

an asynchronous network output queue coupled to the asynchronous packet network;

a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising:

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.

6. The system of claim 5, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction for the configurable computing circuit.

7. The system of claim 5, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.

8. The system of claim 5, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and a data path configuration instruction index to select a synchronized network output of the plurality of synchronized network outputs.

9. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:

a configuration memory multiplexer coupled to the first instruction memory and the second instruction and instruction index memory.

10. The system of claim 9, wherein the current datapath configuration instruction is selected using an instruction index from the second instruction and an instruction index memory when a select input of the configuration memory multiplexer has a first setting.

11. The system of claim 10, wherein when the select input of the configuration memory multiplexer has a second setting different from the first setting, the current datapath configuration instruction is selected using an instruction index from the master synchronization input.

12. The system of claim 5, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and a data path configuration instruction index to configure portions of the configurable circuit independent of the current data path instruction.

13. The system of claim 12, wherein selected ones of the plurality of spoke instruction and data path configuration instruction indices are selected according to a modular spoke count.

14. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:

a conditional logic circuit coupled to the configurable computing circuit.

15. The system of claim 14, wherein the conditional logic circuitry is to provide a first next instruction or instruction index for execution by a next configurable circuit or to provide a second, different, next instruction or instruction index for execution by the next configurable circuit, in dependence on an output from the configurable computing circuitry.

16. The system of claim 14, wherein the conditional logic circuit is to modify the next datapath instruction index provided on a selected one of the plurality of synchronous network outputs as a function of output from the configurable computation circuit.

17. The system of claim 14, wherein the conditional logic circuit, in dependence upon an output from the configurable computation circuit, is to provide a conditional branch by modifying the next datapath instruction or a next datapath instruction index provided on a selected output of the plurality of synchronous network outputs.

18. The system of claim 14, wherein when enabled, the condition logic circuit is to provide a conditional branch by oring the output from the configurable computation circuit with least significant bits of the next datapath instruction or the next datapath instruction index to specify the next datapath instruction or datapath instruction index.

19. The system of claim 5, wherein the plurality of synchronous network inputs comprises:

a plurality of input registers coupled to a plurality of communication lines of the synchronous network; and

an input multiplexer coupled to the plurality of input registers and the second instruction and instruction index memory to select the master synchronization input.

20. The system of claim 5, wherein the plurality of synchronized network outputs comprises:

a plurality of output registers coupled to a plurality of communication lines of the synchronous network; and

an output multiplexer coupled to the configurable computing circuitry to select an output from the configurable computing circuitry.

21. The system of claim 5, wherein the plurality of synchronized network outputs comprises:

an asynchronous fabric state machine coupled to the asynchronous network input queue and the asynchronous network output queue, the asynchronous fabric state machine to decode input packets received from the asynchronous packet network and assemble output packets for transmission over the asynchronous packet network.

22. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:

a direct path connection between the plurality of input registers and the plurality of output registers.

23. The system of claim 22, wherein the direct path connection of a first configurable circuit of the plurality of configurable circuits provides a direct point-to-point connection for data transfer of received data over the synchronization network from a second configurable circuit of the plurality of configurable circuits received over the synchronization network to a third configurable circuit of the plurality of configurable circuits.

24. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:

arithmetic, logical, and bit operation circuitry for performing at least one integer operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.

25. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:

arithmetic, logical, and bit operation circuitry for performing at least one floating point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.

26. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:

a multiply and shift operation circuit for performing at least one integer operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.

27. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:

a multiply and shift operation circuit for performing at least a floating point operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.

28. The system of claim 5, wherein the scheduling interface circuit is to receive a work descriptor packet over the first interconnection network and, in response to the work descriptor packet, generate one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform selected computations.

29. The system of claim 5, wherein each configurable circuit of the plurality of configurable circuits further comprises:

a flow control circuit coupled to the asynchronous network output queue, the flow control circuit to generate a stop signal when a predetermined threshold is reached in the asynchronous network output queue.

30. The system of claim 29, wherein each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal.

31. The system of claim 29, wherein each configurable compute circuit, in response to the stall signal, stalls execution after its current instruction is completed.

32. The system of claim 5, wherein a first plurality of configurable circuits in the array of a plurality of configurable circuits are coupled in a first predetermined order through the synchronization network to form a first synchronization domain; and wherein a second plurality of configurable circuits in the array of configurable circuits is coupled in a second predetermined order through the synchronization network to form a second synchronization domain.

33. The system of claim 32, wherein the first synchronization domain is configured to generate a continue message for transmission to the second synchronization domain over the asynchronous packet network.

34. The system of claim 32, wherein the second synchronization domain is configured to generate a completion message for transmission to the first synchronization domain over the asynchronous packet network.

35. The system of claim 5, wherein the plurality of control registers store a completion table having a first data completion count.

36. The system of claim 35, wherein the plurality of control registers further stores the completion table having a second iteration count.

37. The system of claim 35, wherein the plurality of control registers further stores a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after a current thread is executed.

38. The system of claim 37, wherein the plurality of control registers further store an identification of a first iteration and an identification of a last iteration in the loop table.

39. The system of claim 35 wherein the control circuitry is to queue a thread for execution when a thread identifier for the thread, a completion count for the thread is decremented to zero and its thread identifier is the next thread.

40. The system of claim 35, wherein the control circuitry is to queue a thread for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier.

41. The system of claim 35, wherein the completion count indicates, for each selected thread of a plurality of threads, a predetermined number of completion messages received before execution of the selected thread.

42. The system of claim 35, wherein the plurality of control registers further store a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.

43. The system of claim 35, wherein the plurality of control registers further store a completion table having a loop count for the number of active loop threads, and wherein in response to receiving an asynchronous fabric message that returns a thread identifier to a thread identifier pool, the control circuitry decrements the loop count and transmits an asynchronous fabric completion message when the loop count reaches zero.

44. The system of claim 35, wherein the plurality of control registers further store a top of a thread identifier stack to allow each type of thread identifier to access a private variable for a selected loop.

45. The system of claim 5, wherein the control circuit further comprises:

continuing the queue; and

and re-entering the queue.

46. The system of claim 45, wherein the continuation queue stores one or more thread identifiers for computing threads that have completion counts allowed to execute but do not yet have an assigned thread identifier.

47. The system of claim 45, wherein the re-entry queue stores one or more thread identifiers for compute threads having completion counts allowed to execute and having an assigned thread identifier.

48. The system of claim 45, wherein any thread in the re-entry queue having a thread identifier is executed before any thread in the continue queue having a thread identifier is executed.

49. The system of claim 5, wherein the control circuit further comprises:

a priority queue, wherein any thread in the priority queue having a thread identifier executes before any thread in the resume queue or the re-entry queue having a thread identifier executes.

50. The system of claim 5, wherein the control circuit further comprises:

a run queue, wherein any thread in the run queue having a thread identifier executes after a spoke count at which the thread identifier occurs.

51. The system of claim 5, wherein the control circuitry is to self-schedule a computing thread for execution.

52. The system of claim 5, wherein the control circuitry is to order compute threads for execution.

53. The system of claim 5, wherein the control circuitry is to order loop computation threads for execution.

54. The system of claim 5, wherein the control circuitry is to begin executing a computing thread in response to one or more completion signals from data dependencies.

55. The system of claim 1, wherein the processor comprises:

a processor core to execute the received instructions; and

core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to a received work descriptor packet or in response to a received event packet.

56. The system of claim 1, wherein the processor comprises:

a processor core to execute a shred creation instruction; and

core control circuitry coupled to the processor core, the core control circuitry to automatically schedule the shred creation instructions for execution by the processor core and generate one or more job descriptor data packets destined for another second processor or the group of configurable circuits to execute a corresponding plurality of execution threads.

57. The system of claim 1, wherein the processor comprises:

a processor core to execute a shred creation instruction; and

core control circuitry coupled to the processor core, the core control circuitry to schedule the shred creation instructions for execution by the processor core, reserve a predetermined amount of memory space in thread control memory to store return arguments, and generate one or more work descriptor data packets destined for another second processor or the group of configurable circuits to execute a corresponding plurality of execution threads.

58. The system of claim 1, wherein the processor comprises:

a processor core; and

a core control circuit, comprising:

an interconnection network interface;

a core control memory coupled to the interconnect network interface;

an execution queue coupled to the core control memory;

control logic and thread selection circuitry coupled to the execution queue and the core control memory; and

an instruction cache coupled to the control logic and thread selection circuitry and the processor core.

59. The system of claim 1, wherein the processor comprises:

a processor core; and

a core control circuit, comprising:

an interconnection network interface;

a thread control memory coupled to the interconnect network interface;

a network response memory coupled to the interconnect network interface;

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory;

an instruction cache coupled to the control logic and thread selection circuitry and the processor core; and

a command queue coupled to the processor core and the interconnect network interface.

60. The system of claim 1, wherein the processor comprises:

a processor core; and

a core control circuit, comprising:

a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;

an execution queue coupled to the thread control memory; and

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution by the processor core of instructions of the execution thread, the processor core using data stored in the data cache or general purpose register.

61. The system of claim 1, wherein the processor comprises:

a processor core; and

a core control circuit, comprising:

a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers;

an execution queue coupled to the thread control memory; and

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, and periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.

62. The system of claim 1, wherein the processor comprises:

a processor core; and

a core control circuit, comprising:

a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers;

an execution queue coupled to the thread control memory; and

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration that remains unchanged, and suspend thread execution by not returning the thread identifier to the execution queue when the thread identifier has a suspended state.

63. The system of claim 1, wherein the processor comprises:

a processor core to execute a plurality of instructions; and

core control circuitry coupled to the processor core, the core control circuitry comprising:

an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;

a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and

an instruction cache coupled to the processor core and the control logic and thread selection circuitry to receive the initial program count and provide a corresponding one of the plurality of instructions to the processor core for execution.

64. The system of claim 1, wherein the processor comprises:

a core control circuit, comprising:

an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet and decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments;

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, periodically select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution;

and

a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.

65. The system of claim 1, wherein the processor comprises:

a core control circuit, comprising:

a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread;

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and

a command queue; and

a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.

66. The system of claim 1, wherein the processor comprises:

a core control circuit, comprising:

a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and

a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.

67. The system of claim 1, wherein the processor comprises:

a core control circuit, comprising:

an interconnection network interface coupleable to the interconnection network to receive the invoke work descriptor packet, decode the received work descriptor packet into an execution thread having an initial program count and any received arguments, and encode the work descriptor packet for transmission to other processing elements;

a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;

an execution queue coupled to the thread control memory;

a network response memory coupled to the interconnect network interface;

control logic and thread selection circuitry coupled to the execution queue, the thread control memory, and the instruction cache, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread;

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and

a command queue storing one or more commands to generate one or more work descriptor packets;

and

a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.

68. The system of claim 1, wherein the processor comprises:

a processor core; and

core control circuitry coupled to the processor core.

69. The system of claim 68, wherein the core control circuitry comprises:

an interconnection network interface coupleable to an interconnection network, the interconnection network interface to receive and decode a work descriptor packet into a thread of execution having an initial program count and any received arguments, the interconnection network interface further to receive and decode an event packet into an event identifier and any received arguments.

70. The system of claim 69, wherein the interconnection network interface is further for generating a point-to-point event data message or generating a broadcast event data message.

71. The system of claim 69, wherein the core control circuitry comprises:

control logic and thread selection circuitry coupled to the interconnect network interface, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread.

72. The system of claim 71, wherein the core control circuitry comprises:

a thread control memory having a plurality of registers, the plurality of registers comprising:

a thread identifier pool register to store a plurality of thread identifiers;

a thread state register;

a program count register to store the received initial program count; and

a general register to store the received argument.

73. The system of claim 72, wherein the control logic and thread selection circuitry are further to return a corresponding thread identifier for the selected thread to the thread identifier pool register in response to the processor core executing a return instruction.

74. The system of claim 72, wherein the control logic and thread selection circuitry are further to clear the registers of the thread control memory indexed by the corresponding thread identifier of the selected thread in response to the processor core executing a return instruction.

75. The system of claim 72, wherein the thread control memory further comprises one or more registers selected from the group consisting of:

a pending fiber return count register; returning to the argument buffer or register; the thread returns to the argument chain table register; self-defining an atomic transaction identifier register; caching data; an event status register; an event mask register; and combinations thereof.

76. The system of claim 72, wherein the interconnection network interface is further for storing the execution thread with the initial program count and any received arguments in the thread control memory using a thread identifier as an index to the thread control memory.

77. The system of claim 72, wherein the core control circuitry further comprises:

an execution queue coupled to the thread control memory, the execution queue storing one or more thread identifiers.

78. The system of claim 77, wherein the control logic and thread selection circuitry are further to place the thread identifier in the execution queue, select the thread identifier from the execution queue for execution, and access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread for execution by the processor core.

79. The system of claim 77, wherein the control logic and thread selection circuitry are further to successively select a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread.

80. The system of claim 77, wherein the control logic and thread selection circuitry are further to perform a round-robin or barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread until the execution thread is completed.

81. The system of claim 77, wherein the control logic and thread selection circuitry are further configured to assign an active state or a suspended state to a thread identifier.

82. The system of claim 77, wherein the control logic and thread selection circuitry are further to assign a priority status to a thread identifier.

83. The system of claim 77, wherein the control logic and thread selection circuitry are further to return a corresponding thread identifier having an assigned valid state and an assigned priority to the execution queue after execution of a corresponding instruction.

84. The system of claim 77, wherein the execution queue further comprises:

a first priority queue; and

a second priority queue.

85. The system of claim 84, wherein the control logic and thread selection circuitry further comprises:

thread selection control circuitry coupled to the execution queue, the thread selection control circuitry to select a thread identifier from the first priority queue at a first frequency and to select a thread identifier from the second priority queue at a second frequency, the second frequency being lower than the first frequency.

86. The system of claim 85, wherein the thread selection control circuitry is to determine the second frequency as a skip count starting with selecting a thread identifier from the first priority queue.

87. The system of claim 77, wherein the core control circuitry further comprises:

a network command queue coupled to the processor core.

88. The system of claim 87, wherein the processor core is to execute a shred creation instruction to generate one or more commands that cause the network command queue of the interconnect network interface to generate one or more call work descriptor packets destined for another processor core or a hybrid thread fabric circuit.

89. The system of claim 88, wherein the core control circuitry, in response to the processor core executing a shred creation instruction, is to reserve a predetermined amount of memory space in the general purpose register or a return argument register.

90. The system of claim 77, wherein in response to generating one or more call job descriptor packets destined for another processor core or a hybrid thread fabric, the core control circuitry is to store a thread return count in the thread return register.

91. The system of claim 90 wherein in response to receiving a return data packet, the core control circuitry is to decrement the thread return count stored in the thread return register.

92. The system of claim 92 wherein, in response to the thread return count in the thread return register decrementing to zero, the core control circuitry is to change a suspended state of a corresponding thread identifier to an active state for subsequent execution of a thread return instruction to complete a created shred or thread.

93. The system of claim 71, wherein the core control circuitry further comprises:

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution by the processor core.

94. The system of claim 71, wherein the control logic and thread selection circuitry are further to assign an initial valid state to the execution thread.

95. The system of claim 71, wherein the control logic and thread selection circuitry are further to allocate a suspended state to the execution thread in response to the processor core executing a memory load instruction or a memory store instruction.

96. The system of claim 71, wherein the control logic and thread selection circuitry are further to end executing a selected thread in response to the processor core executing a return instruction.

97. The system of claim 71, wherein the interconnection network interface is further to generate a return job descriptor packet in response to the processor core executing a return instruction.

98. The system of claim 68, wherein the core control circuitry further comprises:

the network response memory.

99. The system of claim 98, wherein the network response memory comprises:

a memory request register; and

a thread identifier and a transaction identifier register.

100. The system of claim 99, wherein the network response memory further comprises one or more registers selected from the group consisting of:

requesting a cache line index register; a byte register; a general register index and type register; and combinations thereof.

101. The system of claim 71, wherein the control logic and thread selection circuitry are further to respond to received event data packets with an event mask stored in the event mask register.

102. The system of claim 71, wherein the control logic and thread selection circuitry are further to determine an event number corresponding to a received event data packet.

103. The system of claim 71, wherein the control logic and thread selection circuitry are further to change the state of a thread identifier from suspended to active to resume execution of a corresponding thread of execution in response to a received event data packet.

104. The system of claim 71, wherein the control logic and thread selection circuitry are further to change the state of a thread identifier from suspended to active to resume execution of a corresponding thread of execution in response to an event number of a received event data packet.

105. The system of claim 71, wherein the interconnection network interface comprises:

inputting a queue;

a packet decoder circuit coupled to the input queue, the control logic and thread selection circuit, and the thread control memory;

an output queue; and

a packet encoder circuit coupled to the output queue, the network response memory, and the network command queue.

106. The system of claim 71, wherein the core control circuitry further comprises:

data path control circuitry for controlling access size through the first interconnection network.

107. The system of claim 71, wherein the core control circuitry further comprises:

data path control circuitry to increase or decrease a memory load access size in response to a time-averaged usage level.

108. The system of claim 71, wherein the core control circuitry further comprises:

data path control circuitry to increase or decrease a memory storage access size in response to a time-averaged usage level.

109. The system of claim 71, wherein the control logic and thread selection circuitry are further to increase a size of a memory load access request to correspond to a cache line boundary of the data cache.

110. The system of claim 71, wherein the core control circuitry further comprises:

system call circuitry to generate one or more system calls to a host processor.

111. The system of claim 110, wherein the system call circuitry further comprises:

a plurality of system call credit registers storing a predetermined credit count to modulate the number of system calls in any predetermined time period.

112. The system of claim 71, wherein the core control circuitry is further to generate a command to cause a command queue of the interconnect network interface to copy and transmit all data corresponding to a selected thread identifier from the thread control memory to monitor thread status in response to a request from a host processor.

113. The system of claim 68, wherein the processor core is to execute a wait or a non-wait fiber join instruction.

114. The system of claim 68, wherein the processor core is to execute a fibre join instruction.

115. The system of claim 68, wherein the processor core is to execute a non-cache read or load instruction to specify a general purpose register to store data received from memory.

116. The system of claim 68, wherein the processor core is to execute a non-cache write or store instruction to specify data in a general purpose register for storage in memory.

117. The system of claim 68, wherein the core control circuitry is to assign a transaction identifier to any load or store request to memory and correlate the transaction identifier with a thread identifier.

118. The system of claim 68, wherein the processor core is to execute a first thread priority instruction to assign a first priority to an execution thread having a corresponding thread identifier.

119. The system of claim 118, wherein the processor core is to execute a second thread priority instruction to assign a second priority to an execution thread having a corresponding thread identifier.

120. The system of claim 68, wherein the processor core is to execute a custom atomic return instruction to complete a thread of execution of a custom atomic operation.

121. The system of claim 68, wherein the processor core, in conjunction with a memory controller, is to perform floating point atomic memory operations.

122. The system of claim 68, wherein the processor core, in conjunction with a memory controller, is to perform custom atomic memory operations.

123. The system of claim 1, wherein data communication through the interconnection network occurs using a global virtual address space across all nodes, independent of a physical address space.

124. The system of claim 1, wherein data communication through the interconnection network is conducted using a virtual node identifier.

125. The system of claim 1, further comprising:

one or more memory circuits operable in a plurality of modes, the plurality of modes comprising: private and non-shared; sharing and interleaving; shared and non-interleaved.

126. The system of claim 1, wherein data communications over the interconnection network are conducted using split header and payload configurations in order to pipeline multiple communications to multiple different destinations.

127. The system of claim 1, wherein the interconnection network is to use the split header and payload configuration for delay payload switching.

128. The system of claim 1, wherein the interconnection network is to route multiple data payloads as consecutive data bursts using a single header.

129. The system of claim 1, wherein the interconnection network is to interleave a first acknowledgement message to a destination in an unused header field of a second message to the destination.

130. The system of claim 1, wherein the interconnection network is adapted for power gating or clock gating based on load requirements.

131. The system of claim 1, wherein each configurable circuit of the plurality of configurable circuits further comprises:

a plurality of delay registers for synchronizing data transmission and reception for thread execution.

132. The system of claim 1, wherein the at least one configurable circuit group further comprises:

a scheduling interface for receiving a work descriptor packet and distributing the work descriptor packet to a selected configurable circuit of the plurality of configurable circuits based on load balancing.

133. The system of claim 1, wherein the at least one configurable circuit group further comprises:

a scheduling interface to receive one or more completion messages from a configurable circuit of the plurality of configurable circuits and generate a return job descriptor packet having a return value.

134. The system of claim 1, wherein each configurable circuit of the plurality of configurable circuits is further to select valid data elements with an execution mask.

135. A system, comprising:

a first interconnection network;

a processor coupled to the interconnection network;

a host interface coupled to the interconnection network; and

at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising:

a plurality of configurable circuits arranged in an array, each configurable circuit comprising:

a configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs and outputs coupled to the configurable computing circuitry;

an asynchronous network input queue and an asynchronous network output queue;

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

a second instruction and instruction index memory storing: a plurality of spoke instructions and datapath configuration instruction indices for selecting a master synchronization input of the synchronization network input, for selecting a current datapath configuration instruction of the configurable computation circuit, and for selecting a next datapath instruction or a next datapath instruction index of a next configurable computation circuit; and

a control circuit coupled to the configurable computing circuit, the control circuit comprising:

a memory control circuit;

a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and

thread control circuitry for queuing threads for execution.

136. A system, comprising:

a first interconnection network;

a host interface coupled to the interconnection network;

at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising a plurality of configurable circuits arranged in an array; and

a processor coupled to the interconnection network, the processor comprising:

a processor core to execute a plurality of instructions; and

core control circuitry coupled to the processor core, the core control circuitry comprising:

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and

137. A method of scheduling and controlling execution of instructions or threads of execution in a system having a processor with a processor core and core control circuitry, the processor coupled to a plurality of multi-threaded configurable circuits by a first interconnection network, the method comprising:

receiving, using the core control circuitry, a job descriptor packet;

decoding, using the core control circuitry, the received job descriptor packet into an execution thread having an initial program count and any received arguments;

assigning, using the core control circuitry, an available thread identifier to the execution thread;

automatically queuing, using the core control circuitry, the thread identifier to execute the execution thread; and

periodically selecting the thread identifier for execution of the execution thread by the processor core.

138. The method of claim 137, further comprising:

using the processor core, an execution fiber creates a plurality of execution threads of instructions for execution by a second processing element; and

in response to executing the shred creation instruction, generating, using network interface circuitry, one or more job descriptor data packets destined for the second processing element to execute the plurality of execution threads.

139. The method of claim 138, further comprising:

using the core control circuitry, a predetermined amount of memory space is reserved in a thread control memory to store return arguments in response to executing the shred creation instruction.

140. The method according to claim 137, wherein said queuing and selecting step further comprises:

automatically queuing, using the core control circuitry, the thread identifier to execute the thread of execution when the thread identifier has a valid state; and

periodically selecting, using the core control circuitry, the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.

141. The method of claim 140, further comprising:

using the core control circuitry, when the thread identifier has a suspended state, suspending thread execution by not returning the thread identifier to the execution queue.

142. The method of claim 137, further comprising:

accessing thread control memory using the thread identifier as an index using the core control circuitry to select the initial program count for the execution thread.

143. The method of claim 137, further comprising:

receiving, using the core control circuitry, an event data packet; and

decoding, using the core control circuitry, the received event data packet into an event identifier and any received arguments.

144. The method of claim 137, further comprising:

assigning, using the core control circuitry, an initial valid state to the execution thread.

145. The method of claim 137, further comprising:

using the core control circuitry, a suspend state is assigned to the execution thread in response to executing a memory load instruction.

146. The method of claim 137, further comprising:

using the core control circuitry, a suspend state is assigned to the execution thread in response to executing a memory store instruction.

147. The method of claim 137, further comprising:

terminating, using the core control circuitry, execution of the selected thread in response to executing the return instruction.

148. The method of claim 137, further comprising:

returning, using the core control circuitry, a corresponding thread identifier for the selected thread to a thread identifier pool in response to executing a return instruction.

149. The method of claim 137, further comprising:

clearing, using the core control circuitry, registers of a thread control memory indexed by the corresponding thread identifier of the selected thread in response to executing a return instruction.

150. The method of claim 137, further comprising:

return job descriptor packets are generated using the network interface circuitry in response to executing the return instructions.

151. The method of claim 137, further comprising:

generating a point-to-point event data message or generating a broadcast event data message using network interface circuitry.

152. The method of claim 137, further comprising:

using the core control circuitry, responding to the received event data packet using an event mask.

153. The method of claim 137, further comprising:

determining, using the core control circuitry, an event number corresponding to the received event data packet.

154. The method of claim 137, further comprising:

using the core control circuitry, responsive to an event number of a received event data packet, changing the state of the thread identifier from suspended to active to resume execution of the corresponding thread of execution.

155. The method of claim 137, further comprising:

using the core control circuitry, successively selecting a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread.

156. The method of claim 137, further comprising:

using the core control circuitry, a round robin or barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers is performed, each for executing a single instruction of a corresponding execution thread, until the execution thread is completed.

157. The method of claim 137, further comprising:

assigning, using the core control circuitry, a valid state or a suspended state to a thread identifier;

assigning, using the core control circuitry, a priority status to a thread identifier;

returning, using the core control circuitry, the corresponding thread identifier having an assigned valid state and an assigned priority to the execution queue after execution of a corresponding instruction.

158. The method of claim 137, wherein the selecting step further comprises:

selecting, using the core control circuitry, a thread identifier from a first priority queue at a first frequency and a thread identifier from a second priority queue at a second frequency, the second frequency being lower than the first frequency.

159. The method of claim 158, further comprising:

determining, using the core control circuitry, the second frequency as a skip count starting with selecting a thread identifier from the first priority queue.

160. The method of claim 137, further comprising:

data path access size is controlled using data path control circuitry.

161. The method of claim 160, further comprising:

using data path control circuitry, the memory load access size or the memory store access size is increased or decreased in response to the time-averaged usage level.

162. The method of claim 137, further comprising:

using the core control circuitry, a size of a memory load access request is increased to correspond to a cache line boundary of a data cache.

163. The method of claim 137, further comprising:

using the network interface circuitry, one or more system calls are generated to the host processor.

164. The method of claim 163, further comprising:

the number of system calls within any predetermined time period is modulated using a predetermined credit count.

165. The method of claim 137, further comprising:

all data from the thread control memory corresponding to the selected thread identifier is copied and transferred in response to a request from the host processor using the core control circuitry to monitor thread status.

166. The method of claim 137, further comprising:

executing, using the processor core, a shred creation instruction to generate one or more commands that generate one or more call job descriptor packets destined for another processor core or a hybrid thread fabric circuit;

using the core control circuitry, in response to executing the shred creation instruction, a predetermined amount of memory space is reserved to store any return arguments.

167. The method of claim 166, further comprising:

storing, using the core control circuitry, a thread return count in a thread return register in response to generating one or more call work descriptor packets, and decrementing the thread return count stored in the thread return register in response to receiving a return data packet.

168. The method of claim 167, wherein responsive to the thread return count in the thread return register decrementing to zero, using the core control circuitry, a suspended state of a corresponding thread identifier is changed to an active state for subsequent execution of a thread return instruction to complete a created fiber or thread.

169. The method of claim 137, further comprising:

executing a wait or non-wait fiber join instruction using the processor core.

170. The method of claim 137, further comprising:

and executing all the fiber adding instructions by using the processor core.

171. The method of claim 137, further comprising:

using the processor core, a non-cache read or load instruction is executed to specify a general purpose register for storing data received from memory.

172. The method of claim 137, further comprising:

using the processor core, a non-cache write or store instruction is executed to designate data in a general purpose register for storage in memory.

173. The method of claim 137, further comprising:

using the core control circuitry, a transaction identifier is assigned to any load or store request to memory and is correlated with a thread identifier.

174. The method of claim 137, further comprising:

executing, using the processor core, a custom atomic return instruction to complete a thread of execution of a custom atomic operation.

175. The method of claim 137, further comprising:

executing, using the processor core, a floating point atomic memory operation.

176. The method of claim 137, further comprising:

using the processor core, performing a custom atomic memory operation.

177. The method of claim 137, further comprising:

providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of configurable circuits of the plurality of configurable circuits; and

providing a plurality of spoke instructions and a datapath configuration instruction index to select a master synchronization input of the synchronization network input using a second instruction and instruction index store.

178. The method of claim 137, further comprising:

providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of configurable circuits of the plurality of configurable circuits; and

providing a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit using a second instruction and instruction index memory.

179. The method of claim 137, further comprising:

providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of configurable circuits of the plurality of configurable circuits; and

providing a plurality of spoke instructions and a data path configuration instruction index to select a next data path configuration instruction for a next configurable circuit of the plurality of configurable circuits using a second instruction and instruction index memory.

180. The method of claim 137, further comprising:

using a conditional logic circuit, providing a conditional branch by modifying a next data path instruction or a next data path instruction index provided to a next configurable circuit of the plurality of configurable circuits in dependence upon an output from a configurable circuit of the plurality of configurable circuits.

181. The method of claim 137, further comprising:

a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.

182. The method of claim 137, further comprising:

a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of a current thread is stored using a plurality of control registers to provide in-order thread execution.

183. The method of claim 137, further comprising:

storing, using a plurality of control registers, a completion table having a first data completion count; and

using thread control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.

184. The method of claim 137, further comprising:

providing, using a first instruction memory, a plurality of data paths using a configuration instruction to configure a data path of a first configurable circuit of the plurality of configurable circuits;

providing a plurality of spoke instructions and datapath configuration instruction indices using a second instruction and instruction index memory to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the first configurable circuit, and select a next datapath instruction or a next datapath instruction index of a second next configurable circuit of the plurality of configurable circuits;

providing, using a plurality of control registers, a completion table having a first data completion count; and

using thread control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.

185. The method of claim 184, further comprising:

providing a plurality of spoke instructions and a data path configuration instruction index to select a synchronized network output of the plurality of synchronized network outputs using the second instruction and instruction index memory.

186. The method of claim 184, further comprising:

providing, using a configuration memory multiplexer, a first selection setting to select the current datapath configuration instruction using an instruction index from the second instruction and an instruction index memory.

187. The method of claim 184, further comprising:

providing, using a configuration memory multiplexer, a second selection setting to select the current datapath configuration instruction using an instruction index from a master synchronization input, the second setting different from the first setting.

188. The method of claim 184, further comprising:

providing a plurality of spoke instructions and a data path configuration instruction index to configure a portion of the configurable circuit independent of the current data path instruction using the second instruction and instruction index memory.

189. The method of claim 184, further comprising:

selecting, using a configuration memory multiplexer, a spoke instruction and a data path configuration instruction index of the plurality of spoke instruction and data path configuration instruction indices according to a modular spoke count.

190. The method of claim 137, further comprising:

providing, using a plurality of control registers, a completion table having a first data completion count; and

using thread control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.

191. The method of claim 137, further comprising:

storing, using a plurality of control registers, a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution, and wherein the plurality of control registers further store a top of a stack of thread identifiers; and

each type of thread identifier is allowed access to the private variable for the selected loop.

192. The method of claim 137, further comprising:

storing, using a plurality of control registers, a completion table having a data completion count;

providing, using thread control circuitry, a continuation queue storing one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier; and

using thread control circuitry, a re-entry queue is provided that stores one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers, such that the threads in the re-entry queue execute after a specified spoke count.

193. The method of claim 137, further comprising:

storing, using a plurality of control registers, a pool of thread identifiers and a completion table having a loop count of the number of active loop threads; and

using thread control circuitry, in response to receiving an asynchronous fabric message that returns a thread identifier to the thread identifier pool, decrementing the loop count, and transmitting an asynchronous fabric completion message when the loop count reaches zero.

194. The method of claim 137, further comprising:

providing a conditional branch by modifying a next datapath instruction or a next datapath instruction index using a conditional logic circuit and dependent upon an output from a configurable circuit of the plurality of configurable circuits.

195. The method of claim 137, further comprising:

enabling a conditional logic circuit; and

using the conditional logic circuit and in dependence upon an output from a configurable circuit of the plurality of configurable circuits, specifying the next datapath instruction or datapath instruction index by oring least significant bits of the next datapath instruction with the output from the configurable computation circuit, thereby providing a conditional branch.

196. The method of claim 137, further comprising:

the primary synchronization input is selected using an input multiplexer.

197. The method of claim 137, further comprising:

selecting an output from a configurable circuit of the plurality of configurable circuits using an output multiplexer.

198. The method of claim 137, further comprising:

using an asynchronous fabric state machine coupled to an asynchronous network input queue and an asynchronous network output queue, input data packets received from the asynchronous packet network are decoded and output data packets for transmission over the asynchronous packet network are assembled.

199. The method of claim 137, further comprising:

providing a plurality of direct point-to-point connections coupling adjacent ones of the plurality of configurable circuits using a synchronization network.

200. The method of claim 199, further comprising:

providing a direct path connection between a plurality of input registers and a plurality of output registers using a first configurable circuit of the plurality of configurable circuits.

201. The method of claim 200 wherein the direct path connection provides a direct point-to-point connection for data transfer from a second configurable circuit of the plurality of configurable circuits received over the synchronous network to a third configurable circuit of the plurality of configurable circuits transmitted over the synchronous network.

202. The method of claim 137, further comprising:

performing, using a configurable circuit of the plurality of configurable circuits, at least one integer or floating point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.

203. The method of claim 137, further comprising:

performing, using a configurable circuit of the plurality of configurable circuits, at least one integer or floating point operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.

204. The method of claim 137, further comprising:

using a scheduling interface circuit, a second work descriptor packet is received over the first interconnection network, and in response to the second work descriptor packet, one or more data and control packets to the plurality of configurable circuits are generated to configure the plurality of configurable circuits to perform selected computations.

205. The method of claim 137, further comprising:

a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.

206. The method of claim 205, wherein each asynchronous network output queue stops outputting data packets on an asynchronous packet network in response to the stop signal.

207. The method of claim 205, wherein each configurable circuit of the plurality of configurable circuits stops executing upon completion of its current instruction in response to the stop signal.

208. The method of claim 137, further comprising:

coupling a first plurality of the plurality of configurable circuits in a first predetermined order through a synchronization network to form a first synchronization domain; and

coupling a second plurality of configurable circuits of the plurality of configurable circuits in a second predetermined order through the synchronization network to form a second synchronization domain.

209. The method of claim 207, further comprising:

generating a continuation message from the first synchronous domain to the second synchronous domain for transmission over the asynchronous packet network.

210. The method of claim 207, further comprising:

generating a completion message from the second synchronous domain to the first synchronous domain for transmission over the asynchronous packet network.

211. The method of claim 137, further comprising:

a completion table having a first data completion count and a second iteration count is stored in a plurality of control registers.

212. The method of claim 137, further comprising:

storing a loop table having a plurality of thread identifiers in a plurality of control registers, and for each thread identifier, storing a next thread identifier for execution after execution of a current thread; and

storing an identification of a first iteration and an identification of a last iteration in the loop table in the plurality of control registers.

213. The method of claim 137, further comprising:

using the control circuitry, the thread is queued for execution when the completion count is decremented to zero for the thread identifier of the thread.

214. The method of claim 137, further comprising:

using the control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.

215. The method of claim 137, further comprising:

using the control circuitry, a thread is queued for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier.

216. The method of claim 214, wherein the completion count indicates, for each selected thread of a plurality of threads, a predetermined number of completion messages received before execution of the selected thread.

217. The method of claim 137, further comprising:

A completion table having a plurality of types of thread identifiers is stored in the plurality of control registers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.

218. The method of claim 216, further comprising:

storing, in the plurality of control registers, a completion table having a loop count of the number of active loop threads, and wherein in response to receiving an asynchronous fabric message that returns a thread identifier to a thread identifier pool, the loop count is decremented using the control circuitry and the asynchronous fabric completion message is transmitted when the loop count reaches zero.

219. The method of claim 216, further comprising:

storing the top of the thread identifier stack in the plurality of control registers to allow each type of thread identifier to access the private variable for the selected loop.

220. The method of claim 137, further comprising:

using the continue queue, one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier are stored.

221. The method of claim 219, further comprising:

using the re-entry queue, one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers are stored.

222. The method of claim 220, further comprising:

executing any threads having thread identifiers in the re-enqueue before executing any threads having thread identifiers in the resume queue.

223. The method of claim 221, further comprising:

executing any threads having a thread identifier in a priority queue prior to executing any threads having a thread identifier in the continue queue or the re-enter queue.

224. The method of claim 137, further comprising:

any thread in the run queue is executed after the spoke count for the thread identifier occurs.

225. The method of claim 137, further comprising:

using the control circuitry, computing threads are self-scheduled for execution.

226. The method of claim 223, further comprising:

using the control circuitry, the computing threads are ordered for execution.

227. The method of claim 223, further comprising:

using the control circuitry, the loop computation thread is ordered for execution.

228. The method of claim 223, further comprising:

using the control circuitry, execution of a computing thread is commenced in response to one or more completion signals from the data dependency.

229. A configurable circuit, comprising:

a configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry; and

a configuration memory coupled to the configurable computing circuitry, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory comprising:

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

230. A configurable circuit, comprising:

a configurable computing circuit; and

a configuration memory coupled to the configurable computing circuitry, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory comprising:

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit.

231. A configurable circuit, comprising:

a configurable computing circuit; and

a configuration memory coupled to the configurable computing circuitry, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory comprising:

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

a second instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices to select a next data path configuration instruction for a next configurable computational circuit.

232. A configurable circuit, comprising:

a configurable computing circuit;

a control circuit coupled to the configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry; and

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

233. A configurable circuit, comprising:

a configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry; and

a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and

a control circuit coupled to the configurable computing circuit, the control circuit comprising:

a memory control circuit;

a thread control circuit; and

a plurality of control registers.

234. A configurable circuit, comprising:

a configurable computing circuit;

a configuration memory coupled to the configurable computing circuitry, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory comprising:

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

a second instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices to select a next data path instruction or a next data path instruction index for a next configurable computational circuit;

and

a conditional logic circuit coupled to the configurable computation circuit, wherein the conditional logic circuit is to provide a conditional branch by modifying the next data path instruction or next data path instruction index provided on a selected output of the plurality of synchronous network outputs as a function of an output from the configurable computation circuit.

235. A configurable circuit, comprising:

a configurable computing circuit;

a control circuit coupled to the configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry;

an asynchronous network input queue coupled to an asynchronous packet network and the first memory circuit;

an asynchronous network output queue; and

236. A configurable circuit, comprising:

a configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry; and

a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and

a control circuit coupled to the configurable computing circuit, the control circuit comprising:

a memory control circuit;

a thread control circuit; and

a plurality of control registers, wherein the plurality of control registers store a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of a current thread to provide in-order thread execution.

237. A configurable circuit, comprising:

a configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry; and

a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and

a control circuit coupled to the configurable computing circuit, the control circuit comprising:

a memory control circuit;

a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and

thread control circuitry to queue a thread for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.

238. A configurable circuit, comprising:

a configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs and outputs coupled to the configurable computing circuitry;

an asynchronous network input queue and an asynchronous network output queue;

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

a control circuit coupled to the configurable computing circuit, the control circuit comprising:

a memory control circuit;

a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and

thread control circuitry to queue a thread for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.

239. A configurable circuit, comprising:

a configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry; and

a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and

a control circuit coupled to the configurable computing circuit, the control circuit comprising:

a memory control circuit;

a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and

thread control circuitry for queuing a thread for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread and its thread identifier is the next thread.

240. A configurable circuit, comprising:

a configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry; and

a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and

a control circuit coupled to the configurable computing circuit, the control circuit comprising:

a memory control circuit;

a thread control circuit; and

a plurality of control registers storing a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution, and wherein the plurality of control registers further stores a top of a stack of thread identifiers to allow each type of thread identifier to access a private variable for a selected loop.

241. A configurable circuit, comprising:

a configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry; and

a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and

a control circuit coupled to the configurable computing circuit, the control circuit comprising:

a memory control circuit;

a plurality of control registers; and

thread control circuitry, comprising:

a resume queue that stores one or more thread identifiers for computing threads that have completion counts allowed for execution but do not yet have an assigned thread identifier; and

a re-entry queue that stores one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers such that the threads in the re-entry queue execute after a specified spoke count.

242. A configurable circuit, comprising:

a configurable computing circuit;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronous network inputs coupled to the configurable computing circuitry;

a plurality of synchronous network outputs coupled to the configurable computing circuitry;

a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output; and

a control circuit coupled to the configurable computing circuit, the control circuit comprising:

a memory control circuit;

a plurality of control registers storing a pool of thread identifiers and a completion table having a loop count of an active loop thread number; and

thread control circuitry, wherein in response to receiving an asynchronous fabric message that returns a thread identifier to the thread identifier pool, the control circuitry decrements the loop count and transmits an asynchronous fabric completion message when the loop count reaches zero.

243. A system, comprising:

an asynchronous packet network;

a synchronization network; and

a plurality of configurable circuits arranged in an array, each configurable circuit of the plurality of configurable circuits being simultaneously coupled to the synchronous network and the asynchronous packet network, the plurality of configurable circuits being configured to form a plurality of synchronous domains using the synchronous network to perform a plurality of computations, and the plurality of configurable circuits being further configured to generate and transmit a plurality of control messages over the asynchronous packet network, the plurality of control messages including one or more completion messages and continuation messages.

244. A system, comprising:

a plurality of configurable circuits arranged in an array;

a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; and

an asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array.

245. A system, comprising:

an interconnection network;

a processor coupled to the interconnection network; and

a plurality of configurable circuit groups coupled to the interconnection network.

246. A system, comprising:

an interconnection network;

a processor coupled to the interconnection network;

a host interface coupled to the interconnection network; and

a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising:

a plurality of configurable circuits arranged in an array;

a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array;

an asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array;

a memory interface coupled to the asynchronous packet network and the interconnection network; and

a scheduling interface coupled to the asynchronous packet network and the interconnection network.

247. A system, comprising:

a hierarchical interconnect network including a first plurality of crossbars having a folded Clos configuration and a plurality of direct mesh connections at interfaces with endpoints;

a processor coupled to the interconnection network;

a host interface coupled to the interconnection network; and

a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising:

a plurality of configurable circuits arranged in an array;

a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array and providing a plurality of direct connections between adjacent configurable circuits of the array;

an asynchronous packet network comprising a second plurality of crossbars, each crossbar coupled to at least one configurable circuit of the plurality of configurable circuits of the array and another crossbar of the second plurality of crossbars;

a memory interface coupled to the asynchronous packet network and the interconnection network; and

a scheduling interface coupled to the asynchronous packet network and the interconnection network.

248. A system, comprising:

an interconnection network;

a processor coupled to the interconnection network;

a host interface coupled to the interconnection network; and

a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising:

a synchronization network;

an asynchronous packet network;

a memory interface coupled to the asynchronous packet network and the interconnection network;

a scheduling interface coupled to the asynchronous packet network and the interconnection network; and

a plurality of configurable circuits arranged in an array, each configurable circuit comprising:

a configurable computing circuit;

a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit;

a thread control circuit; and a plurality of control registers;

a first memory circuit coupled to the configurable computing circuit;

a plurality of synchronization network inputs and outputs coupled to the configurable computing circuitry and the synchronization network;

an asynchronous network input queue and an asynchronous network output queue coupled to the asynchronous packet network;

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

249. The configurable circuit or system of any of preceding claims 227 to 246, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computational circuit.

250. The configurable circuit or system of any of preceding claims 227 to 247, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.

251. The configurable circuit or system of any one of the preceding claims 227 to 248, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and a data path configuration instruction index to select a synchronous network output of the plurality of synchronous network outputs.

252. The configurable circuit or system of any of preceding claims 227-249, further comprising:

a configuration memory multiplexer coupled to the first instruction memory and the second instruction and instruction index memory.

253. The configurable circuit or system of any one of preceding claims 227 to 250, wherein the current datapath configuration instruction is selected using an instruction index from the second instruction and instruction index memory when a select input of the configuration memory multiplexer has a first setting.

254. The configurable circuit or system of any one of preceding claims 227-251, wherein the current datapath configuration instruction is selected using an instruction index from the master synchronization input when the select input of the configuration memory multiplexer has a second setting that is different from the first setting.

255. The configurable circuit or system of any of preceding claims 227 to 252, wherein the second instruction and instruction index memory further stores a plurality of spoke instructions and a data path configuration instruction index to configure portions of the configurable circuit independent of the current data path instruction.

256. The configurable circuit or system of any one of preceding claims 227 to 253, wherein a selected spoke instruction and data path configuration instruction index of the plurality of spoke instruction and data path configuration instruction indices is selected according to a modular spoke count.

257. The configurable circuit or system of any one of the preceding claims 227-254, further comprising:

a conditional logic circuit coupled to the configurable computing circuit.

258. The configurable circuit or system of any one of the preceding claims 227 to 255, wherein the condition logic circuitry is to modify the next datapath instruction index provided on a selected one of the plurality of synchronous network outputs in dependence on an output from the configurable computation circuitry.

259. The configurable circuit or system of any one of the preceding claims 227 to 256, wherein the condition logic circuitry is to provide conditional branching by modifying the next data path instruction or next data path instruction index provided on a selected output of the plurality of synchronous network outputs in dependence on an output from the configurable computation circuitry.

260. The configurable circuit or system of any of preceding claims 227 to 257, wherein when enabled, the condition logic circuitry is to specify the next datapath instruction or datapath instruction index by oring least significant bits of the next datapath instruction with the output from the configurable computation circuitry, providing a conditional branch.

261. The configurable circuit or system of any one of the preceding claims 227 to 258, wherein when enabled, the condition logic circuitry is to specify the next datapath instruction index by oring least significant bits of the next datapath instruction index with the output from the configurable computation circuitry, providing a conditional branch.

262. The configurable circuit or system of any one of preceding claims 227-259, wherein the plurality of synchronous network inputs comprises:

a plurality of input registers coupled to a plurality of communication lines of the synchronous network; and

an input multiplexer coupled to the plurality of input registers and the second instruction and instruction index memory to select the master synchronization input.

263. The configurable circuit or system of any of preceding claims 227 to 260, wherein the plurality of synchronous network outputs comprises:

a plurality of output registers coupled to a plurality of communication lines of the synchronous network; and

an output multiplexer coupled to the configurable computing circuitry to select an output from the configurable computing circuitry.

264. The configurable circuit or system of any one of preceding claims 227-261, further comprising:

265. The configurable circuit or system of any of preceding claims 227 to 262, wherein the asynchronous packet network comprises a plurality of crossbars, each crossbar coupled to a plurality of configurable circuits and at least one other crossbar.

266. The configurable circuit or system of any of the preceding claims 227-263, further comprising:

an array of a plurality of configurable circuits, wherein:

each configurable circuit is coupled to the synchronization network through the plurality of synchronization network inputs and the plurality of synchronization network outputs; and is

Each configurable circuit is coupled to the asynchronous packet network through the asynchronous network input and the asynchronous network output.

267. The configurable circuit or system of any of preceding claims 227 to 264, wherein the synchronization network comprises a plurality of direct point-to-point connections coupling adjacent configurable circuits in the array of the plurality of configurable circuits.

268. The configurable circuit or system of any one of preceding claims 227 to 265, wherein each configurable circuit further comprises:

a direct path connection between the plurality of input registers and the plurality of output registers.

269. The configurable circuit or system of any one of preceding claims 227 to 266, wherein the direct path connection provides a direct point-to-point connection for data transfer from a second configurable circuit received over the synchronous network to a third configurable circuit transmitted over the synchronous network.

270. The configurable circuit or system of any one of the preceding claims 227 to 267, wherein the configurable computation circuitry comprises arithmetic, logic and bit operation circuitry for performing at least one integer operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.

271. The configurable circuit or system of any one of the preceding claims 227 to 268, wherein the configurable computing circuitry comprises arithmetic, logical and bit-arithmetic circuitry for performing at least one floating-point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater than or equal to, signed and unsigned less than or equal to, equal to or unequal comparison, logical AND operation, logical OR operation, logical XOR operation, logical NAND operation, logical NOR operation, logical XOR operation, logical NAND operation, integer and floating point transitions, and combinations thereof.

272. The configurable circuit or system of any one of preceding claims 227 to 269, wherein the configurable computation circuitry comprises multiply and shift operation circuitry for performing at least one integer operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.

273. The configurable circuit or system of any one of the preceding claims 227 to 270, wherein the configurable computation circuitry comprises multiply and shift operation circuitry for performing at least a floating-point operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.

274. The configurable circuit or system of any of preceding claims 227-271, wherein the array of the plurality of configurable circuits is further coupled to a first interconnection network.

275. The configurable circuit or system of any of preceding claims 227-272, wherein the array of the plurality of configurable circuits further comprises:

a third system memory interface circuit; and

a scheduling interface circuit.

276. The configurable circuit or system of any of preceding claims 227 to 273, wherein the scheduling interface circuit is to receive a work descriptor packet over the first interconnection network and, in response to the work descriptor packet, generate one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform selected calculations.

277. The configurable circuit or system of any one of the preceding claims 227-274, further comprising:

278. The configurable circuit or system of any one of preceding claims 227 to 275, wherein each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal.

279. The configurable circuit or system of any one of preceding claims 227 to 276, wherein each configurable computing circuit stops executing after completion of its current instruction in response to the stop signal.

280. The configurable circuit or system of any of preceding claims 227-277, wherein a first plurality of configurable circuits in the array of a plurality of configurable circuits is coupled in a first predetermined order through the synchronization network to form a first synchronization domain; and wherein a second plurality of configurable circuits in the array of configurable circuits is coupled in a second predetermined order through the synchronization network to form a second synchronization domain.

281. The configurable circuit or system of any of preceding claims 227-278, wherein the first synchronization domain is to generate a continue message for transmission over the asynchronous packet network to the second synchronization domain.

282. The configurable circuit or system of any of preceding claims 227 to 279, wherein the second synchronization domain is to generate a completion message for transmission over the asynchronous packet network to the first synchronization domain.

283. The configurable circuit or system of any one of preceding claims 227 to 280, wherein the plurality of control registers store a completion table having a first data completion count.

284. The configurable circuit or system of any one of preceding claims 227-281, wherein the plurality of control registers further store the completion table with a second iteration count.

285. The configurable circuit or system of any one of the preceding claims 227 to 282, wherein the plurality of control registers further store a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after a current thread is executed.

286. The configurable circuit or system of any one of the preceding claims 227 to 283, wherein the plurality of control registers further store an identification of a first iteration and an identification of a last iteration in the loop table.

287. The configurable circuit or system of any one of preceding claims 227 to 283, wherein the control circuitry is to queue a thread for execution when a thread identifier for the thread, a completion count for the thread is decremented to zero and its thread identifier is a next thread.

288. The configurable circuit or system of any one of preceding claims 227 to 284, wherein the control circuitry is to queue a thread for execution when a completion count for the thread indicates completion of any data dependencies for the thread identifier of the thread.

289. The configurable circuit or system of any one of preceding claims 227 to 285, wherein the completion count indicates a predetermined number of completion messages received for each selected thread of a plurality of threads before execution of the selected thread.

290. The configurable circuit or system of any one of the preceding claims 227 to 286, wherein the plurality of control registers further store a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.

291. The configurable circuit or system of any one of the preceding claims 227 to 287, wherein the plurality of control registers further store a completion table having a loop count for the number of active loop threads, and wherein in response to receiving an asynchronous fabric message that returns a thread identifier to a thread identifier pool, the control circuitry decrements the loop count and transmits an asynchronous fabric completion message when the loop count reaches zero.

292. The configurable circuit or system of any one of the preceding claims 227 to 288, wherein the plurality of control registers further store a top of a stack of thread identifiers to allow each type of thread identifier to access a private variable for a selected loop.

293. The configurable circuit or system of any of preceding claims 227-289, wherein the control circuit further comprises:

continuing the queue; and

and re-entering the queue.

294. The configurable circuit or system of any one of the preceding claims 227 to 290, wherein the continuation queue stores one or more thread identifiers for computing threads that have completion counts allowed for execution but do not yet have an assigned thread identifier.

295. The configurable circuit or system of any one of the preceding claims 227 to 291, wherein the re-enqueue stores one or more thread identifiers for computing threads having completion counts allowed for execution and having an assigned thread identifier.

296. The configurable circuit or system of any one of preceding claims 227 to 292, wherein any thread in the re-entry queue having a thread identifier is executed before any thread in the continue queue having a thread identifier is executed.

297. The configurable circuit or system of any one of the preceding claims 227-293, wherein the control circuit further comprises:

a priority queue, wherein any thread in the priority queue having a thread identifier executes before any thread in the resume queue or the re-entry queue having a thread identifier executes.

298. The configurable circuit or system of any of preceding claims 227-294, wherein the control circuit further comprises:

a run queue, wherein any thread in the run queue having a thread identifier executes after a spoke count at which the thread identifier occurs.

299. The configurable circuit or system of any of preceding claims 227-295, wherein the second configuration memory circuit comprises:

a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

300. The configurable circuit or system of any of the preceding claims 227 to 296, wherein the control circuitry is to self-schedule a computing thread for execution.

301. The configurable circuit or system of any of preceding claims 227-297, wherein the condition logic circuitry is to branch to a different second next instruction for execution by a next configurable circuit.

302. The configurable circuit or system of any one of preceding claims 227 to 298, wherein the control circuitry is to order computational threads for execution.

303. The configurable circuit or system of any one of the preceding claims 227 to 299, wherein the control circuit is to order loop computation threads for execution.

304. The configurable circuit or system of any one of preceding claims 227 to 300, wherein the control circuitry is to begin executing a computational thread in response to one or more completion signals from data dependencies.

305. A method of configuring a configurable circuit, comprising:

providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

306. A method of configuring a configurable circuit, comprising:

providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

307. A method of configuring a configurable circuit, comprising:

providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and

a plurality of spoke instructions and datapath configuration instruction indices are provided to select a next datapath configuration instruction of a next configurable computational circuit using a second instruction and instruction index memory.

308. A method of controlling thread execution of a multi-threaded configurable circuit, the configurable circuit having configurable computing circuitry, the method comprising:

using conditional logic circuitry, providing a conditional branch by modifying the next data path instruction or a next data path instruction index provided to a next configurable circuit in dependence upon an output from the configurable computation circuitry.

309. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:

a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.

310. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:

311. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:

storing, using a plurality of control registers, a completion table having a first data completion count; and

using thread control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.

312. A method of configuring and controlling thread execution of multithreaded configurable circuitry having configurable computing circuitry, the method comprising:

providing, using a first instruction memory, a plurality of configuration instructions to configure a data path of the configurable computing circuit;

providing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the configurable computing circuit, and select a next datapath instruction or a next datapath instruction index of a next configurable computing circuit using a second instruction and instruction index memory;

providing, using a plurality of control registers, a completion table having a first data completion count; and

using thread control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.

313. A method of configuring and controlling thread execution of multithreaded configurable circuitry having configurable computing circuitry, the method comprising:

providing, using a first instruction memory, a plurality of configuration instructions to configure a data path of the configurable computing circuit;

providing, using a plurality of control registers, a completion table having a first data completion count; and

using thread control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.

314. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:

each type of thread identifier is allowed access to the private variable for the selected loop.

315. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:

storing, using a plurality of control registers, a completion table having a data completion count;

316. A method of controlling thread execution of a multi-threaded configurable circuit, comprising:

storing, using a plurality of control registers, a pool of thread identifiers and a completion table having a loop count of the number of active loop threads; and

317. The method of any of the preceding claims 302-313, further comprising:

providing, using the second instruction and instruction index memory, a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit.

318. The method of any of the preceding claims 302-314, further comprising:

providing, using the second instruction and instruction index memory, a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.

319. The method of any preceding claim 302-315, further comprising:

320. The method of any of the preceding claims 302-316, further comprising:

321. The method of any one of the preceding claims 302-317, further comprising:

322. The method of any one of the preceding claims 302-318, further comprising:

323. The method of any of the preceding claims 302-319, further comprising:

324. The method of any of the preceding claims 302-320, further comprising:

modifying the next datapath instruction or next datapath instruction index using conditional logic circuitry and in dependence upon output from the configurable computation circuitry.

325. The method of any of the preceding claims 302-321, further comprising:

providing a conditional branch by modifying the next datapath instruction or next datapath instruction index using conditional logic circuitry and in dependence upon output from the configurable computation circuitry.

326. The method of any of the preceding claims 302-322, further comprising:

enabling a conditional logic circuit; and

using the conditional logic circuit and in dependence upon an output from the configurable computation circuit, specifying the next datapath instruction or datapath instruction index by oring the least significant bit of the next datapath instruction with the output from the configurable computation circuit, thereby providing a conditional branch.

327. The method of any preceding claim 302-323, further comprising:

selecting the primary synchronization input using an input multiplexer.

328. The method of any of the preceding claims 302-324, further comprising:

selecting an output from the configurable computing circuit using an output multiplexer.

329. The method of any of the preceding claims 302-325, further comprising:

330. The method of any of the preceding claims 302-326, further comprising:

providing a plurality of direct point-to-point connections coupling adjacent configurable circuits in the array of the plurality of configurable circuits using the synchronization network.

331. The method of any of the preceding claims 302-327, further comprising:

using the configurable circuit, a direct path connection between a plurality of input registers and a plurality of output registers is provided.

332. The method of any of the preceding claims 302-328, wherein the direct path connection provides a direct point-to-point connection for data transfer from a second configurable circuit received over the synchronous network to a third configurable circuit transmitted over the synchronous network.

333. The method of any one of the preceding claims 302-329, further comprising:

using the configurable computing circuitry, performing at least one integer or floating point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.

334. The method of any of the preceding claims 302-330, further comprising:

using the configurable computing circuitry, performing at least one integer or floating point operation selected from the group consisting of: multiplication, shifting, passing inputs, signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.

335. The method of any of the preceding claims 302-331, further comprising:

using a scheduling interface circuit, receiving a work descriptor packet over the first interconnection network, and in response to the work descriptor packet, generating one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform selected calculations.

336. The method of any of the preceding claims 302-332, further comprising:

a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.

337. The method of any one of the preceding claims 302-333, wherein each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal.

338. The method of any of the preceding claims 302-334, wherein each configurable computing circuit, in response to the stop signal, stops executing upon completion of its current instruction.

339. The method of any of the preceding claims 302-335, further comprising:

coupling a first plurality of configurable circuits in the array of a plurality of configurable circuits in a first predetermined order through the synchronization network to form a first synchronization domain; and

coupling a second plurality of configurable circuits in the array of configurable circuits in a second predetermined order through the synchronization network to form a second synchronization domain.

340. The method of any of the preceding claims 302-336, further comprising:

generating a continuation message from the first synchronous domain to the second synchronous domain for transmission over the asynchronous packet network.

341. The method of any of the preceding claims 302-337, further comprising:

generating a completion message from the second synchronous domain to the first synchronous domain for transmission over the asynchronous packet network.

342. The method of any of the preceding claims 302-338, further comprising:

a completion table having a first data completion count is stored in the plurality of control registers.

343. The method of any one of the preceding claims 302-339, further comprising:

storing the completion table with a second iteration count in the plurality of control registers.

344. The method of any of the preceding claims 302-340, further comprising:

a loop table having a plurality of thread identifiers is stored in the plurality of control registers, and for each thread identifier, a next thread identifier is stored for execution after execution of a current thread.

345. The method of any of the preceding claims 302-341, further comprising:

storing an identification of a first iteration and an identification of a last iteration in the loop table in the plurality of control registers.

346. The method of any of the preceding claims 302-342, further comprising:

using the control circuitry, a thread is queued for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.

347. The method of any preceding claim 302-343, further comprising:

using the control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.

348. The method of any of the preceding claims 302-344, further comprising:

using the control circuitry, a thread is queued for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier.

349. The method of any preceding claim 302-345, wherein the completion count indicates, for each selected thread of a plurality of threads, a predetermined number of completion messages received before execution of the selected thread.

350. The method of any of the preceding claims 302-346, further comprising:

a completion table having a plurality of types of thread identifiers is stored in the plurality of control registers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.

351. The method of any one of the preceding claims 302-347, further comprising:

352. The method of any of the preceding claims 302-348, further comprising:

storing the top of the thread identifier stack in the plurality of control registers to allow each type of thread identifier to access the private variable for the selected loop.

353. The method of any one of the preceding claims 302-349, further comprising:

using the continue queue, one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier are stored.

354. The method of any of the preceding claims 302-350, further comprising:

using the re-entry queue, one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers are stored.

355. The method of any of the preceding claims 302-351, further comprising:

executing any threads having thread identifiers in the re-enqueue before executing any threads having thread identifiers in the resume queue.

356. The method of any of the preceding claims 302-352, further comprising:

executing any threads having a thread identifier in a priority queue prior to executing any threads having a thread identifier in the continue queue or the re-enter queue.

357. The method of any of the preceding claims 302-354, further comprising:

any thread in the run queue is executed after the spoke count for the thread identifier occurs.

358. The method of any of the preceding claims 302-354, further comprising:

using the control circuitry, computing threads are self-scheduled for execution.

359. The method of any one of the preceding claims 302-355, further comprising:

using the conditional logic circuit, a different second next instruction is branched to for execution by a next configurable circuit.

360. The method of any one of the preceding claims 302-356, further comprising:

using the control circuitry, the computing threads are ordered for execution.

361. The method of any one of the preceding claims 302-357, further comprising:

using the control circuitry, the loop computation thread is ordered for execution.

362. The method of any of the preceding claims 302-358, further comprising:

using the control circuitry, execution of a computing thread is commenced in response to one or more completion signals from the data dependency.

363. A processor, comprising:

a processor core to execute the received instructions; and

core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to received work descriptor data packets.

364. A processor, comprising:

a processor core to execute the received instructions; and

core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to received event data packets.

365. A processor, comprising:

a processor core to execute a shred creation instruction; and

core control circuitry coupled to the processor core, the core control circuitry to automatically schedule the shred creation instructions for execution by the processor core and generate one or more job descriptor data packets destined for another processor or hybrid thread fabric circuitry to execute a corresponding plurality of execution threads.

366. A processor, comprising:

a processor core to execute a shred creation instruction; and

core control circuitry coupled to the processor core, the core control circuitry to schedule the shred creation instructions for execution by the processor core, reserve a predetermined amount of memory space in a thread control memory to store return arguments, and generate one or more job descriptor data packets destined for another processor or a hybrid thread fabric circuitry to execute a corresponding plurality of execution threads.

367. A processor, comprising:

a core control circuit, comprising:

an interconnection network interface;

a thread control memory coupled to the interconnect network interface;

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue, the thread control memory; and

an instruction cache coupled to the control logic and thread selection circuitry; and

a processor core coupled to the instruction cache of the core control circuitry.

368. A processor, comprising:

a core control circuit, comprising:

an interconnection network interface;

a thread control memory coupled to the interconnect network interface;

a network response memory;

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue, the thread control memory;

an instruction cache coupled to the control logic and thread selection circuitry; and

a command queue; and

a processor core coupled to the instruction cache and the command queue of the core control circuitry.

369. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:

an execution queue coupled to the thread control memory; and

370. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:

an execution queue coupled to the thread control memory; and

371. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:

an execution queue coupled to the thread control memory; and

372. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:

an execution queue coupled to the thread control memory; and

control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution by the processor core of instructions of the execution thread, the processor core using data stored in the data cache or general purpose register.

373. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:

a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers;

an execution queue coupled to the thread control memory; and

control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, and periodically select the thread identifier for execution by the processor core of instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.

374. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:

a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers;

an execution queue coupled to the thread control memory; and

control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration that the valid state remains unchanged, and suspend thread execution by not returning the thread identifier to the execution queue when the thread identifier has a suspended state.

375. A processor comprising a processor core and core control circuitry coupled to the processor core, the core control circuitry comprising:

a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument;

an execution queue coupled to the thread control memory; and

376. A processor, comprising:

a processor core to execute a plurality of instructions; and

core control circuitry coupled to the processor core, the core control circuitry comprising:

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and

377. A processor, comprising:

a core control circuit, comprising:

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, periodically select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread;

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and

a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.

378. A processor, comprising:

a core control circuit, comprising:

a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread;

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and

a command queue; and

a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.

379. A processor, comprising:

a core control circuit, comprising:

a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;

an execution queue coupled to the thread control memory;

control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and

a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.

380. A processor, comprising:

a core control circuit, comprising:

a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument;

an execution queue coupled to the thread control memory;

a network response memory coupled to the interconnect network interface;

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and

a command queue storing one or more commands to generate one or more work descriptor packets; and

a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.

381. The processor of any one of the preceding claims 360-377, wherein the core control circuitry comprises:

an interconnection network interface coupleable to an interconnection network, the interconnection network interface to receive a work descriptor packet, decode the received work descriptor packet into an execution thread having an initial program count and any received arguments.

382. The processor of any of the preceding claims 360 to 378, wherein the interconnection network interface is further to receive an event data packet, decode the received event data packet into an event identifier and any received arguments.

383. The processor of any of the preceding claims 360-379, wherein the core control circuitry further comprises:

384. The processor of any one of the preceding claims 360-380, wherein the core control circuitry further comprises:

a thread control memory having a plurality of registers, the plurality of registers comprising:

a thread identifier pool register to store a plurality of thread identifiers.

385. The processor of any one of preceding claims 360-381, wherein the thread control memory further comprises:

a thread status register.

386. The processor of any one of the preceding claims 360-382, wherein the thread control memory further comprises:

a program count register to store the received initial program count.

387. The processor of any of preceding claims 360-383, wherein the thread control memory further comprises:

a general register to store the received argument.

388. The processor of any one of the preceding claims 360-384, wherein the thread control memory further comprises:

the pending fiber returns to the count register.

389. The processor of any of the preceding claims 360-385, wherein the thread control memory further comprises:

returning to the argument buffer or register.

390. The processor of any one of the preceding claims 360-386, wherein the thread control memory further comprises:

and returning to the argument linked list register.

391. The processor of any of the preceding claims 360-387, wherein the thread control memory further comprises:

a custom atomic transaction identifier register.

392. The processor of any one of the preceding claims 360-388, wherein the thread control memory further comprises:

and (6) caching data.

393. The processor of any of preceding claims 360-389, wherein the interconnection network interface is further to store the execution thread having the initial program count and any received arguments in the thread control memory using a thread identifier as an index to the thread control memory.

394. The processor of any one of the preceding claims 360-390, wherein the core control circuitry further comprises:

control logic and thread selection circuitry coupled to the thread control memory and the interconnect network interface, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread.

395. The processor of any one of the preceding claims 360-391, wherein the core control circuitry further comprises:

an execution queue coupled to the thread control memory, the execution queue storing one or more thread identifiers.

396. The processor of any of the preceding claims 360-392, wherein the core control circuitry further comprises:

control logic and thread selection circuitry coupled to the execution queue, the interconnect network interface, and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, and access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread.

397. The processor of any one of the preceding claims 360-393, wherein the core control circuit further comprises:

an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide a corresponding instruction for execution.

398. The processor of any one of the preceding claims 360-394, wherein the processor further comprises:

a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.

399. The processor of any of the preceding claims 360-395, wherein the core control circuitry is further to assign an initial valid state to the execution thread.

400. The processor of any one of the preceding claims 360-396, wherein the core control circuitry is further to allocate a halt state to the execution thread in response to the processor core executing a memory load instruction.

401. The processor of any one of the preceding claims 360-397, wherein the core control circuitry is further to allocate a halt state to the execution thread in response to the processor core executing a memory store instruction.

402. The processor of any one of the preceding claims 360-398, wherein the core control circuitry is further for ending execution of a selected thread in response to the processor core executing a return instruction.

403. The processor of any one of the preceding claims 360-399, wherein the core control circuitry is further to return a corresponding thread identifier for the selected thread to the thread identifier pool register in response to the processor core executing a return instruction.

404. The processor of any one of the preceding claims 360-400, wherein the core control circuitry is further to clear the register of the thread control memory indexed by the corresponding thread identifier of the selected thread in response to the processor core executing a return instruction.

405. The processor of any one of the preceding claims 360-401, wherein the interconnection network interface is further to generate a return job descriptor packet in response to the processor core executing a return instruction.

406. The processor of any of the preceding claims 360-402, wherein the core control circuitry further comprises:

the network response memory.

407. The processor according to any of the preceding claims 360-403, wherein the network response memory comprises:

a memory request register.

408. The processor of any of preceding claims 360-404, wherein the network response memory comprises:

a thread identifier and a transaction identifier register.

409. The processor of any of the preceding claims 360-405, wherein the network response memory comprises:

the cache line index register is requested.

410. The processor of any one of the preceding claims 360-406, wherein the network response memory comprises:

a byte register.

411. The processor of any one of the preceding claims 360-407, wherein the network response memory comprises:

a general register index and a type register.

412. The processor of any one of the preceding claims 360-408, wherein the thread control memory further comprises:

an event status register.

413. The processor of any one of the preceding claims 360-409, wherein the thread control memory further comprises:

an event mask register.

414. The processor of any one of the preceding claims 360 to 410, wherein the interconnection network interface is to generate a point-to-point event data message.

415. The processor of any one of the preceding claims 360 to 411, wherein the interconnection network interface is to generate a broadcast event data message.

416. The processor of any of the preceding claims 360-412, wherein the core control circuitry is further to respond to a received event data packet with an event mask stored in the event mask register.

417. The processor of any of the preceding claims 360-413, wherein the core control circuitry is further to determine an event number corresponding to a received event data packet.

418. The processor of any of the preceding claims 360-414 wherein the core control circuitry is further to change the state of a thread identifier from suspended to active to resume execution of a corresponding thread of execution in response to a received event data packet.

419. The processor of any of the preceding claims 360-415, wherein the core control circuitry is further to change a state of a thread identifier from suspended to active to resume execution of a corresponding thread of execution in response to an event number of a received event data packet.

420. The processor of any one of the preceding claims 360-417, wherein the control logic and thread selection circuitry are further to successively select a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread.

421. The processor of any one of the preceding claims 360 to 418, wherein the control logic and thread selection circuitry are further to perform a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.

422. The processor of any one of the preceding claims 360 to 419, wherein the control logic and thread selection circuitry are further to perform a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread until the execution thread is completed.

423. The processor of any one of preceding claims 360-420, wherein the control logic and thread selection circuitry are further to perform barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.

424. The processor of any one of the preceding claims 360-421, wherein the control logic and thread selection circuitry are further to assign an active state or a suspended state to a thread identifier.

425. The processor of any one of the preceding claims 360-422, wherein the control logic and thread selection circuitry are further to assign a priority status to a thread identifier.

426. The processor of any one of the preceding claims 360-423, wherein the control logic and thread selection circuitry are further to return the corresponding thread identifier having an assigned valid state and an assigned priority to the execution queue after execution of a corresponding instruction.

427. The processor of any one of the preceding claims 360-425, wherein the core control circuitry further comprises:

a network command queue coupled to the processor core.

428. The processor of any of the preceding claims 360-426, wherein the interconnection network interface comprises:

inputting a queue;

a packet decoder circuit coupled to the input queue, the control logic and thread selection circuit, and the thread control memory;

an output queue; and

a packet encoder circuit coupled to the output queue, the network response memory, and the network command queue.

429. The processor of any one of the preceding claims 360-427, wherein the execution queue further comprises:

a first priority queue; and

a second priority queue.

430. The processor of any one of the preceding claims 360-428, wherein the control logic and thread selection circuitry further comprises:

431. The processor of any of the preceding claims 360-429, wherein the thread selection control circuitry is to determine the second frequency as a skip count starting with selecting a thread identifier from the first priority queue.

432. The processor of any one of the preceding claims 360-430, wherein the core control circuitry further comprises:

data path control circuitry for controlling access size through the first interconnection network.

433. The processor of any one of the preceding claims 360-431, wherein the core control circuit further comprises:

data path control circuitry to increase or decrease a memory load access size in response to a time-averaged usage level.

434. The processor of any one of the preceding claims 360-432, wherein the core control circuitry further comprises:

data path control circuitry to increase or decrease a memory storage access size in response to a time-averaged usage level.

435. The processor of any one of the preceding claims 360-433, wherein the control logic and thread selection circuitry are further to increase a size of a memory load access request to correspond to a cache line boundary of the data cache.

436. The processor of any one of the preceding claims 360-434, wherein the core control circuitry further comprises:

system call circuitry to generate one or more system calls to a host processor.

437. The processor of any of the preceding claims 360-435, wherein the system call circuitry further comprises:

a plurality of system call credit registers storing a predetermined credit count to modulate the number of system calls in any predetermined time period.

438. The processor of any of the preceding claims 360-436, wherein the core control circuitry is further to generate a command to cause the command queue of the interconnect network interface to copy and transmit all data corresponding to a selected thread identifier from the thread control memory to monitor thread status in response to a request from a host processor.

439. The processor of any one of the preceding claims 360-437, wherein the processor core is to execute a shred creation instruction to generate one or more commands that are to cause the command queue of the interconnect network interface to generate one or more call work descriptor packets destined for another processor core or a hybrid thread fabric circuit.

440. The processor of any one of the preceding claims 360-438, wherein in response to the processor core executing a shred creation instruction, the core control circuitry is to reserve a predetermined amount of memory space in the general purpose register or a return argument register.

441. The processor of any one of the preceding claims 360-439, wherein in response to generating one or more call job descriptor packets destined for another processor core or a hybrid thread fabric, the core control circuitry is to store a thread return count in the thread return register.

442. The processor of any one of the preceding claims 360-440, wherein in response to receiving a return data packet, the core control circuitry is to decrement the thread return count stored in the thread return register.

443. The processor of any one of the preceding claims 360 to 441, wherein in response to the thread return count in the thread return register decrementing to zero, the core control circuitry is to change the suspended state of the corresponding thread identifier to an active state for subsequent execution of a thread return instruction to complete a created shred or thread.

444. The processor of any one of the preceding claims 360-442, wherein the processor core is to execute a wait or a non-wait fiber join instruction.

445. The processor of any one of the preceding claims 360-443, wherein the processor core is to execute a fibre join instruction.

446. The processor of any one of the preceding claims 360-444, wherein the processor core is to execute a non-cache read or load instruction to specify a general purpose register to store data received from memory.

447. The processor of any one of the preceding claims 360-445, wherein the processor core is to execute a non-cache write or store instruction to specify data in a general purpose register for storage in memory.

448. The processor of any one of preceding claims 360 to 446, wherein the core control circuitry is to allocate a transaction identifier to any load or store request to memory and to correlate the transaction identifier with a thread identifier.

449. The processor of any one of the preceding claims 360-447, wherein the processor core is to execute a first thread priority instruction to assign a first priority to an execution thread having a corresponding thread identifier.

450. The processor of any one of the preceding claims 360-448, wherein the processor core is to execute a second thread priority instruction to assign a second priority to an execution thread having a corresponding thread identifier.

451. The processor of any one of claims 360-449, wherein the processor core is to execute a custom atomic return instruction to complete a thread of execution of a custom atomic operation.

452. The processor of any one of the preceding claims 360-450, wherein in conjunction with a memory controller, the processor core is to perform floating point atomic memory operations.

453. The processor of any one of the preceding claims 360-451, wherein the processor core, in conjunction with a memory controller, is to perform custom atomic memory operations.

454. A method of self-scheduling execution of instructions, comprising:

receiving a work descriptor data packet; and

the instructions are automatically scheduled for execution in response to the received job descriptor data packet.

455. A method of self-scheduling execution of instructions, comprising:

receiving an event data packet; and

the instructions are automatically scheduled for execution in response to the received event data packet.

456. A method of causing a first processing element to generate a plurality of threads of execution for execution by a second processing element, comprising:

executing a fiber program creating instruction; and

in response to executing the shred creation instruction, generating one or more job descriptor data packets destined for the second processing element to execute the plurality of execution threads.

457. A method of causing a first processing element to generate a plurality of threads of execution for execution by a second processing element, comprising:

executing a fiber program creating instruction; and

in response to executing the shred creation instruction, a predetermined amount of memory space is reserved in a thread control memory to store return arguments and one or more work descriptor data packets destined for the second processing element are generated to execute the plurality of execution threads.

458. A method of self-scheduling execution of instructions, comprising:

receiving a work descriptor data packet;

decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments;

assigning an available thread identifier to the execution thread;

automatically queuing the thread identifier for execution of the execution thread; and

the thread identifier is periodically selected to execute the execution thread.

459. A method of self-scheduling execution of instructions, comprising:

receiving a work descriptor data packet;

decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments;

assigning an available thread identifier to the execution thread;

automatically queuing the thread identifier for execution of the execution thread when the thread identifier has a valid state; and

periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.

460. A method of self-scheduling execution of instructions, comprising:

receiving a work descriptor data packet;

decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments;

assigning an available thread identifier to the execution thread;

automatically queuing the thread identifier in an execution queue for execution of the execution thread when the thread identifier has a valid state; and

periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged; and

when the thread identifier has a suspended state, suspending thread execution by not returning the thread identifier to the execution queue.

461. A method of self-scheduling execution of instructions, comprising:

receiving a work descriptor data packet;

decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments;

storing the initial program count and any received arguments in a thread control memory;

assigning an available thread identifier to the execution thread;

automatically queuing the thread identifier for execution of the execution thread when the thread identifier has a valid state;

accessing the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and

periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.

462. The method of any of the preceding claims 453-460, further comprising:

receiving an event data packet; and

the received event data packet is decoded into an event identifier and any received arguments.

463. The method of any of the preceding claims 453-461, further comprising:

an initial valid state is assigned to the execution thread.

464. The method of any of the preceding claims 453-462, further comprising:

a suspend state is assigned to the execution thread in response to executing a memory load instruction.

465. The method of any of the preceding claims 453-463, further comprising:

a suspend state is assigned to the execution thread in response to executing a memory store instruction.

466. The method of any of the preceding claims 453-464, further comprising:

in response to executing the return instruction, execution of the selected thread is terminated.

467. The method of any of the preceding claims 453-465, further comprising:

in response to executing a return instruction, returning a corresponding thread identifier for the selected thread to the thread identifier pool.

468. The method of any of the preceding claims 453-466, further comprising:

in response to executing a return instruction, clearing the register of thread control memory indexed by the corresponding thread identifier of the selected thread.

469. The method of any of the preceding claims 453-467, further comprising:

in response to executing the return instruction, a return job descriptor packet is generated.

470. The method of any of the preceding claims 453-468, further comprising:

a point-to-point event data message is generated.

471. The method of any of the preceding claims 453-469, further comprising:

a broadcast event data message is generated.

472. The method of any of the preceding claims 453-470, further comprising:

the received event data packet is responded to with an event mask.

473. The method of any of the preceding claims 453-471, further comprising:

an event number corresponding to the received event data packet is determined.

474. The method of any of the preceding claims 453-472, further comprising:

in response to the received event data packet, the state of the thread identifier is changed from suspended to active to resume execution of the corresponding execution thread.

475. The method of any of the preceding claims 453-473, further comprising:

in response to the event number of the received event data packet, the state of the thread identifier is changed from suspended to active to resume execution of the corresponding execution thread.

476. The method of any of the preceding claims 453-474, further comprising:

successively selecting a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread.

477. The method of any of the preceding claims 453-475, further comprising:

performing a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers for executing a single instruction of a corresponding execution thread, respectively.

478. The method of any of the preceding claims 453-476, further comprising:

a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers is performed for executing a single instruction of a corresponding execution thread, respectively, until the execution thread is completed.

479. The method of any of the preceding claims 453-477, further comprising:

performing a barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.

480. The method of any of the preceding claims 453-478, further comprising:

a valid state or a suspended state is assigned to the thread identifier.

481. The method of any of the preceding claims 453-479, further comprising:

a priority status is assigned to the thread identifier.

482. The method of any of the preceding claims 453-480, further comprising:

after executing a corresponding instruction, returning the corresponding thread identifier with an assigned valid state and an assigned priority to the execution queue.

483. The method of any of the preceding claims 453-481, further comprising:

thread identifiers from a first priority queue are selected at a first frequency and thread identifiers from a second priority queue are selected at a second frequency, the second frequency being lower than the first frequency.

484. The method of any of the preceding claims 453-482, further comprising:

determining the second frequency as a skip count from selecting a thread identifier from the first priority queue.

485. The method of any of the preceding claims 453-483, further comprising:

controlling the data path access size.

486. The method of any of the preceding claims 453-484, further comprising:

the memory load access size is increased or decreased in response to the time-averaged usage level.

487. The method of any of the preceding claims 453-485, further comprising:

the memory storage access size is increased or decreased in response to the time-averaged usage level.

488. The method of any of the preceding claims 453-486, further comprising:

the size of the memory load access request is increased to correspond to a cache line boundary of the data cache.

489. The method of any of the preceding claims 453-487, further comprising:

one or more system calls are generated to a host processor.

490. The method of any of the preceding claims 453-488, further comprising:

the number of system calls within any predetermined time period is modulated using a predetermined credit count.

491. The method of any of the preceding claims 453-489, further comprising:

all data from the thread control memory corresponding to the selected thread identifier is copied and transferred in response to a request from the host processor to monitor thread status.

492. The method of any of the preceding claims 453-490, further comprising:

the fabric creation instruction is executed to generate one or more commands that generate one or more call job descriptor packets destined for another processor core or a hybrid thread fabric circuit.

493. The method of any one of the preceding claims 453-491, further comprising:

in response to executing the shred creation instruction, a predetermined amount of memory space is reserved to store any return arguments.

494. The method of any of the preceding claims 453-492, further comprising:

storing a thread return count in the thread return register in response to generating one or more call work descriptor packets.

495. The method of any of preceding claims 453-493, wherein the thread return count stored in the thread return register is decremented in response to receipt of a return data packet.

496. The method of any of preceding claims 453-494, wherein in response to the thread return count in the thread return register decrementing to zero, the suspended state of the corresponding thread identifier is changed to an active state for subsequent execution of a thread return instruction to complete the created fibre or thread.

497. The method of any of the preceding claims 453-495, further comprising:

a wait or non-wait fiber join instruction is executed.

498. The method of any of the preceding claims 453-496, further comprising:

and executing the all-fiber adding instruction.

499. The method of any of the preceding claims 453-497, further comprising:

a non-cache read or load instruction is executed to specify a general purpose register for storing data received from memory.

500. The method of any of the preceding claims 453-498, further comprising:

a non-cache write or store instruction is executed to specify data in general purpose registers for storage in memory.

501. The method of any one of the preceding claims 453-499, further comprising:

the transaction identifier is assigned to any load or store request to memory and is correlated with the thread identifier.

502. The method of any of the preceding claims 453-500, further comprising:

a first thread priority instruction is executed to assign a first priority to an execution thread having a corresponding thread identifier.

503. The method of any of the preceding claims 453-501, further comprising:

a second thread priority instruction is executed to assign a second priority to the execution thread having the corresponding thread identifier.

504. The method of any of the preceding claims 453-502, further comprising:

and executing the custom atomic return instruction to complete the execution thread of the custom atomic operation.

505. The method of any of the preceding claims 453-503, further comprising:

a floating point atomic memory operation is performed.

506. The method of any of the preceding claims 453-504, further comprising:

a custom atomic memory operation is performed.

Technical Field

The present invention relates generally to configurable computing circuitry, and more particularly to a heterogeneous computing system including a self-scheduling processor, which is a configurable computing circuitry with embedded interconnect networks and which can be dynamically reconfigured and dynamically controlled for power consumption or power consumption.

Background

Many existing computing systems reach significant limitations in computing processing power in terms of computing speed, energy (or power) consumption, and associated heat dissipation. For example, as the demand for advanced computing technologies continues to grow, such as to accommodate artificial intelligence and other important computing applications, existing computing solutions are becoming less and less adequate.

Therefore, there is a current need for a computing architecture that can provide a high performance and energy saving solution for computationally intensive cores, for example, to compute Fast Fourier Transforms (FFTs) and Finite Impulse Response (FIR) filters for sensing, communication and analysis applications, such as synthetic aperture radar, 5G base stations, and graphic analysis applications, such as, but not limited to, graphic clustering using spectral techniques, machine learning, 5G networking algorithms, and large die codes.

In addition, there is a need for a configurable computing architecture that can be configured for any of these different applications, but most importantly, that is also capable of dynamic self-configuration and self-reconfiguration. Finally, there is a need for a processor architecture that is capable of massive parallel processing and further interacts with and controls a configurable computing architecture to execute any of these various applications.

Disclosure of Invention

As discussed in more detail below, representative apparatus, systems, and methods provide a computing architecture capable of providing a high performance and energy efficient solution for compute intensive cores, for example, to compute Fast Fourier Transforms (FFTs) and Finite Impulse Response (FIR) filters for sensing, communication, and analysis applications, such as synthetic aperture radar, 5G base stations, and graphic analysis applications, such as graphic clustering using spectral techniques, machine learning, 5G networking algorithms, and large die code, but are not so limited.

Notably, the various representative embodiments provide a multi-threaded coarse-grained configurable computing architecture that can be configured for any of these different applications, but most importantly, is also capable of self-scheduling, dynamic self-configuration and self-reconfiguration, conditional branching, backpressure control for asynchronous signaling, ordered and loop thread execution (including data dependencies), automatic start of thread execution after data dependency and/or ordering is completed, providing loop access to private variables, providing fast execution of loop threads using reentrant queues, and advanced loop execution using various thread identifiers, including nested loops.

As also discussed in greater detail below, the representative apparatus, system, and method provide a processor architecture capable of self-scheduling, massively parallel processing, and further interacting with and controlling a configurable computing architecture to execute any of these different applications.

In a representative embodiment, a system comprises: a first interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising: a plurality of configurable circuits arranged in an array; a second asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array; a third synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; a memory interface circuit coupled to the asynchronous packet network and the interconnection network; and scheduling interface circuitry coupled to the asynchronous packet network and the interconnection network.

For any of the various representative embodiments, the interconnection network may include: a first plurality of cross-bar switches having a folded Clos configuration and a plurality of direct mesh connections located at an interface with the system endpoint 935. For any of the various representative embodiments, the asynchronous packet network may comprise: a second plurality of crossbars, each crossbar coupled to at least one configurable circuit of the plurality of configurable circuits of the array and another crossbar of the second plurality of crossbars. For any of the various representative embodiments, the synchronization network may comprise: a plurality of direct point-to-point connections coupling adjacent configurable circuits of the array of the plurality of configurable circuits of the group of configurable circuits.

In a representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a configuration memory coupled to the configurable computing circuitry, control circuitry, the synchronous network input, and the synchronous network output, wherein the configuration memory comprises: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.

In a representative embodiment, each configurable circuit of the plurality of configurable circuits comprises: a configurable computing circuit; a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit; a thread control circuit; and a plurality of control registers; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronization network inputs coupled to the configurable computing circuitry and the synchronization network; a plurality of synchronization network outputs coupled to the configurable computing circuitry and the synchronization network; an asynchronous network input queue coupled to the asynchronous packet network; an asynchronous network output queue coupled to the asynchronous packet network; a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.

In another representative embodiment, a system may comprise: a first interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising: a plurality of configurable circuits arranged in an array, each configurable circuit comprising: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs and outputs coupled to the configurable computing circuitry; an asynchronous network input queue and an asynchronous network output queue; a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output, the second configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing: a plurality of spoke instructions and datapath configuration instruction indices for selecting a master synchronization input of the synchronization network input, for selecting a current datapath configuration instruction of the configurable computing circuit, and for selecting a next datapath instruction or a next datapath instruction index of a next configurable computing circuit; and control circuitry coupled to the configurable computing circuitry, the control circuitry comprising: a memory control circuit; a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and thread control circuitry for queuing threads for execution.

In another representative embodiment, a system may comprise: a first interconnection network; a host interface coupled to the interconnection network; at least one configurable circuit group coupled to the interconnection network, the configurable circuit group comprising a plurality of configurable circuits arranged in an array; and a processor coupled to the interconnection network, the processor comprising: a processor core to execute a plurality of instructions; and core control circuitry coupled to the processor core, the core control circuitry comprising: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and an instruction cache coupled to the processor core and the control logic and thread selection circuitry to receive the initial program count and provide a corresponding one of the plurality of instructions to the processor core for execution.

In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; and a configuration memory coupled to the configurable computing circuitry, control circuitry, synchronous network input, and synchronous network output, the configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit.

In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; and a configuration memory coupled to the configurable computing circuitry, control circuitry, synchronous network input, and synchronous network output, the configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices to select a next data path configuration instruction for a next configurable computational circuit.

In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a control circuit coupled to the configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.

In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a configuration memory coupled to the configurable computing circuitry, control circuitry, a synchronous network input, and a synchronous network output, the configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices to select a next data path instruction or a next data path instruction index for a next configurable computational circuit; and conditional logic circuitry coupled to the configurable computation circuitry, wherein the conditional logic circuitry is to provide a conditional branch by modifying the next data path instruction or next data path instruction index provided on a selected output of the plurality of synchronous network outputs in dependence upon an output from the configurable computation circuitry.

In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a control circuit coupled to the configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; an asynchronous network input queue coupled to an asynchronous packet network and the first memory circuit; an asynchronous network output queue; and a flow control circuit coupled to the asynchronous network output queue, the flow control circuit to generate a stop signal when a predetermined threshold is reached in the asynchronous network output queue.

In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs and outputs coupled to the configurable computing circuitry; an asynchronous network input queue and an asynchronous network output queue; a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output, the second configuration memory comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing: a plurality of spoke instructions and datapath configuration instruction indices for selecting a master synchronization input of the synchronization network input, for selecting a current datapath configuration instruction of the configurable computing circuit, and for selecting a next datapath instruction or a next datapath instruction index of a next configurable computing circuit; and the configurable circuit further comprises a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit; a plurality of control registers, wherein the plurality of control registers store a completion table having a first data completion count; and thread control circuitry to queue a thread for execution when a completion count for the thread is decremented to zero for the thread identifier of the thread.

In yet another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output; and the configurable circuit further comprises a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit; a thread control circuit; and a plurality of control registers storing a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution, and wherein the plurality of control registers further store a top of a stack of thread identifiers to allow each type of thread identifier to access a private variable for a selected loop.

In another representative embodiment, a configurable circuit may comprise: a configurable computing circuit; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronous network inputs coupled to the configurable computing circuitry; a plurality of synchronous network outputs coupled to the configurable computing circuitry; and a second configuration memory circuit coupled to the configurable computing circuit, control circuitry, the synchronous network input, and the synchronous network output; and control circuitry coupled to the configurable computing circuitry, the control circuitry comprising: a memory control circuit; a plurality of control registers; and thread control circuitry comprising: a resume queue that stores one or more thread identifiers for computing threads that have completion counts allowed for execution but do not yet have an assigned thread identifier; and a re-entry queue that stores one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers, such that the threads in the re-entry queue execute after a specified spoke count.

In a representative embodiment, a system is disclosed that may include: an asynchronous packet network; a synchronization network; and a plurality of configurable circuits arranged in an array, each configurable circuit of the plurality of configurable circuits being simultaneously coupled to the synchronous network and the asynchronous packet network, the plurality of configurable circuits being configured to form a plurality of synchronous domains using the synchronous network to perform a plurality of computations, and the plurality of configurable circuits being further configured to generate and transmit a plurality of control messages over the asynchronous packet network, the plurality of control messages including one or more completion messages and continuation messages.

In another representative embodiment, a system may comprise: a plurality of configurable circuits arranged in an array; a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; and an asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array.

In another representative embodiment, a system may comprise: an interconnection network; a processor coupled to the interconnection network; and a plurality of groups of configurable circuits coupled to the interconnection network.

In a representative embodiment, a system comprises: an interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising: a plurality of configurable circuits arranged in an array; a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array; an asynchronous packet network coupled to each configurable circuit of the plurality of configurable circuits of the array; a memory interface coupled to the asynchronous packet network and the interconnection network; and a scheduling interface coupled to the asynchronous packet network and the interconnection network.

In another representative embodiment, a system may comprise: a hierarchical interconnect network including a first plurality of crossbars having a folded Clos configuration and a plurality of direct mesh connections at interfaces with endpoints; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising: a plurality of configurable circuits arranged in an array; a synchronization network coupled to each configurable circuit of the plurality of configurable circuits of the array and providing a plurality of direct connections between adjacent configurable circuits of the array; an asynchronous packet network comprising a second plurality of crossbars, each crossbar coupled to at least one configurable circuit of the plurality of configurable circuits of the array and another crossbar of the second plurality of crossbars; a memory interface coupled to the asynchronous packet network and the interconnection network; and a scheduling interface coupled to the asynchronous packet network and the interconnection network.

In another representative embodiment, a system may comprise: an interconnection network; a processor coupled to the interconnection network; a host interface coupled to the interconnection network; and a plurality of configurable circuit groups coupled to the interconnection network, each configurable circuit group of the plurality of configurable circuit groups comprising: a synchronization network; an asynchronous packet network; a memory interface coupled to the asynchronous packet network and the interconnection network; a scheduling interface coupled to the asynchronous packet network and the interconnection network; and a plurality of configurable circuits arranged in an array, each configurable circuit comprising: a configurable computing circuit; a control circuit coupled to the configurable computing circuit, the control circuit comprising: a memory control circuit; a thread control circuit; and a plurality of control registers; a first memory circuit coupled to the configurable computing circuit; a plurality of synchronization network inputs and outputs coupled to the configurable computing circuitry and the synchronization network; an asynchronous network input queue and an asynchronous network output queue coupled to the asynchronous packet network; a second configuration memory circuit coupled to the configurable computing circuit, the control circuitry, the synchronous network input, and the synchronous network output, the configuration memory circuit comprising: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.

In any of the various representative embodiments, the second instruction and instruction index memory may further store a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuitry.

In any of the various representative embodiments, the second instruction and instruction index memory may further store a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.

In any of the various representative embodiments, the second instruction and instruction index memory may further store a plurality of spoke instructions and a data path configuration instruction index to select a synchronized network output of the plurality of synchronized network outputs.

In any of the various representative embodiments, the configurable circuit or system may further comprise: a configuration memory multiplexer coupled to the first instruction memory and the second instruction and instruction index memory.

In any of the various representative embodiments, the current datapath configuration instruction may be selected using an instruction index from the second instruction and an instruction index memory when the select input of the configuration memory multiplexer has a first setting.

In any of the various representative embodiments, when the select input of the configuration memory multiplexer has a second setting different from the first setting, the current datapath configuration instruction may be selected using an instruction index from the master synchronization input.

In any of the various representative embodiments, the second instruction and instruction index memory may further store a plurality of spoke instructions and datapath configuration instruction indices to configure portions of the configurable circuit independent of the current datapath instruction.

In any of the various representative embodiments, selected ones of the plurality of spoke instruction and data path configuration instruction indices may be selected according to a modular spoke count.

In any of the various representative embodiments, the configurable circuit or system may further comprise: a conditional logic circuit coupled to the configurable computing circuit.

In any of the various representative embodiments, the conditional logic circuit is operable to modify the next datapath instruction index provided on a selected one of the plurality of synchronous network outputs, as a function of the output from the configurable computation circuit.

In any of the various representative embodiments, the conditional logic circuit is operable, in dependence upon an output from the configurable computational circuit, to provide a conditional branch by modifying the next datapath instruction or a next datapath instruction index provided on a selected one of the plurality of synchronous network outputs.

In any of the various representative embodiments, when enabled, the conditional logic circuit may be operative to specify the next datapath instruction or datapath instruction index by oring the least significant bits of the next datapath instruction with the output from the configurable computing circuit, providing a conditional branch.

In any of the various representative embodiments, when enabled, the conditional logic circuit may be operative to specify the next datapath instruction index by oring the least significant bits of the next datapath instruction index with the output from the configurable computing circuit, providing a conditional branch.

In any of the various representative embodiments, the plurality of synchronized network inputs may include: a plurality of input registers coupled to a plurality of communication lines of the synchronous network; and an input multiplexer coupled to the plurality of input registers and the second instruction and instruction index memory to select the master synchronization input.

In any of the various representative embodiments, the plurality of synchronized network outputs may include: a plurality of output registers coupled to a plurality of communication lines of the synchronous network; and an output multiplexer coupled to the configurable computing circuitry to select an output from the configurable computing circuitry.

In any of the various representative embodiments, the configurable circuit or system may further comprise: an asynchronous fabric state machine coupled to the asynchronous network input queue and the asynchronous network output queue, the asynchronous fabric state machine to decode input packets received from the asynchronous packet network and assemble output packets for transmission over the asynchronous packet network.

In any of the various representative embodiments, the asynchronous packet network may include a plurality of crossbars, each crossbar coupled to a plurality of configurable circuits and at least one other crossbar.

In any of the various representative embodiments, the configurable circuit or system may further comprise: an array of a plurality of configurable circuits, wherein: each configurable circuit is coupled to the synchronization network through the plurality of synchronization network inputs and the plurality of synchronization network outputs; and each configurable circuit is coupled to the asynchronous packet network through the asynchronous network input and the asynchronous network output.

In any of the various representative embodiments, the synchronization network may include a plurality of direct point-to-point connections coupling adjacent configurable circuits of the array of the plurality of configurable circuits.

In any of the various representative embodiments, each configurable circuit may comprise: a direct path connection between the plurality of input registers and the plurality of output registers. In any of the various representative embodiments, the direct path connection may provide a direct point-to-point connection for data transfer from a second configurable circuit received over the synchronous network to a third configurable circuit transmitted over the synchronous network.

In any of the various representative embodiments, the configurable computing circuitry may comprise arithmetic, logic, and bit-operation circuitry for performing at least one integer operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.

In any of the various representative embodiments, the configurable computing circuitry may comprise arithmetic, logical, and bit-arithmetic circuitry for performing at least one floating-point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater than or equal to, signed and unsigned less than or equal to, equal to or unequal comparison, logical AND operation, logical OR operation, logical XOR operation, logical NAND operation, logical NOR operation, logical XOR operation, logical NAND operation, integer and floating point transitions, and combinations thereof.

In any of the various representative embodiments, the configurable computing circuitry may comprise multiplication and shift operation circuitry for performing at least one integer operation selected from the group consisting of: multiplication, shifting, passing input (passin input), signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.

In any of the various representative embodiments, the configurable computing circuitry may comprise multiply and shift operation circuitry for performing at least a floating point operation selected from the group consisting of: multiplication, shifting, passing input (pass input), signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.

In any of the various representative embodiments, the array of the plurality of configurable circuits may be further coupled to a first interconnection network. In any of the various representative embodiments, the array of the plurality of configurable circuits may further comprise: a third system memory interface circuit; and a scheduling interface circuit. In any of the various representative embodiments, the scheduling interface circuit may be operative to receive a job descriptor packet over the first interconnection network, and in response to the job descriptor packet, generate one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform the selected computation.

In any of the various representative embodiments, the configurable circuit or system may further comprise: a flow control circuit coupled to the asynchronous network output queue, the flow control circuit to generate a stop signal when a predetermined threshold is reached in the asynchronous network output queue. In any of the various representative embodiments, each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal. In any of the various representative embodiments, each configurable compute circuit, in response to the stall signal, stalls execution after its current instruction is completed.

In any of the various representative embodiments, a first plurality of configurable circuits in an array of a plurality of configurable circuits may be coupled in a first predetermined order through a synchronization network to form a first synchronization domain; and wherein a second plurality of configurable circuits in the array of the plurality of configurable circuits are coupled in a second predetermined order through the synchronization network to form a second synchronization domain. In any of the various representative embodiments, the first synchronization domain may be used to generate a continue message that is transmitted to the second synchronization domain over the asynchronous packet network. In any of the various representative embodiments, the second synchronization domain may be used to generate a completion message that is transmitted to the first synchronization domain over the asynchronous packet network.

In any of the various representative embodiments, the plurality of control registers may store a completion table having a first data completion count. In any of the various representative embodiments, the plurality of control registers further stores a completion table having a second iteration count. In any of the various representative embodiments, the plurality of control registers may further store a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of the current thread. In any of the various representative embodiments, the plurality of control registers may further store an identification of a first iteration and an identification of a last iteration in the loop table.

In any of the various representative embodiments, the control circuitry may be operative to queue a thread for execution when, for the thread identifier of the thread, the completion count of the thread is decremented to zero and its thread identifier is the next thread.

In any of the various representative embodiments, the control circuitry may be operative to queue a thread for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier. In any of the various representative embodiments, the completion count may indicate, for each selected thread of the plurality of threads, a predetermined number of completion messages received before execution of the selected thread.

In any of the various representative embodiments, the plurality of control registers may further store a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.

In any of the various representative embodiments, the plurality of control registers may further store a completion table having a loop count of an active loop thread number, and wherein in response to receiving an asynchronous fabric message returning a thread identifier to a thread identifier pool, the control circuitry decrements the loop count and transmits the asynchronous fabric completion message when the loop count reaches zero. In any of the various representative embodiments, the plurality of control registers may further store a top of a thread identifier stack to allow each type of thread identifier to access the private variable for a selected loop.

In any of the various representative embodiments, the control circuit may further comprise: continuing the queue; and re-enqueue. In any of the various representative embodiments, the resume queue stores one or more thread identifiers for computing threads that have completion counts allowed to execute but do not yet have an assigned thread identifier. In any of the various representative embodiments, the re-entry queue may store one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers. In any of the various representative embodiments, any thread having a thread identifier in the re-entry queue may execute before any thread having a thread identifier in the resume queue is executed.

In any of the various representative embodiments, the control circuit may further comprise: a priority queue, wherein any thread in the priority queue having a thread identifier may execute before any thread in the resume queue or the re-entry queue having a thread identifier is executed.

In any of the various representative embodiments, the control circuit may further comprise: a run queue, wherein any thread in the run queue having a thread identifier can execute after a spoke count at which the thread identifier occurs.

In any of the various representative embodiments, the second configuration memory circuit may include: a first instruction memory storing a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and a second instruction and instruction index memory storing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of the synchronization network input.

In any of the various representative embodiments, the control circuitry may be used to self-schedule a computing thread for execution.

In any of the various representative embodiments, the conditional logic circuit is operable to branch to a different second next instruction for execution by the next configurable circuit.

In any of the various representative embodiments, the control circuitry may be used to order compute threads for execution. In any of the various representative embodiments, the control circuitry may be used to order loop computation threads for execution. In any of the various representative embodiments, the control circuitry may be operative to begin executing a computing thread in response to one or more completion signals from a data dependency.

Various method embodiments of configuring a configurable circuit are also disclosed. One representative method embodiment may comprise: providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and providing a plurality of spoke instructions and a data path configuration instruction index to select a master synchronization input of the plurality of synchronization network inputs using a second instruction and instruction index memory.

In any of the various representative embodiments, a method of configuring a configurable circuit may comprise: providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and providing a plurality of spoke instructions and datapath configuration instruction indices using a second instruction and instruction index memory to select a current datapath configuration instruction of the configurable computing circuitry.

In any of the various representative embodiments, a method of configuring a configurable circuit may comprise: providing, using a first instruction memory, a plurality of datapath configuration instructions to configure datapaths of the configurable computing circuitry; and providing a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction of a next configurable computational circuit using a second instruction and instruction index memory.

Also disclosed is a method of controlling thread execution of a multi-threaded configurable circuit, wherein the configurable circuit has configurable computational circuitry. One representative method embodiment may comprise: conditional branches are provided using conditional logic circuitry by modifying a next datapath instruction or a next datapath instruction index provided to a next configurable circuit, depending on the output from the configurable computation circuitry.

Another representative method embodiment of controlling thread execution by a multithreaded configurable circuit may comprise: a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit.

Another representative method embodiment of controlling thread execution by a multithreaded configurable circuit may comprise: a loop table having a plurality of thread identifiers and, for each thread identifier, a next thread identifier for execution after execution of a current thread is stored using a plurality of control registers to provide in-order thread execution.

Another representative method embodiment of controlling thread execution by a multithreaded configurable circuit may comprise: storing, using a plurality of control registers, a completion table having a first data completion count; and queuing, using thread control circuitry, a thread for execution when a completion count of the thread is decremented to zero for a thread identifier of the thread.

A method of configuring and controlling thread execution of multithreaded configurable circuitry having configurable computing circuitry is disclosed, wherein a representative method embodiment comprises: providing, using a first instruction memory, a plurality of configuration instructions to configure a data path of the configurable computing circuit; providing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the configurable computing circuit, and select a next datapath instruction or a next datapath instruction index of a next configurable computing circuit using a second instruction and instruction index memory; providing, using a plurality of control registers, a completion table having a first data completion count; and queuing, using thread control circuitry, a thread for execution when a completion count of the thread is decremented to zero for a thread identifier of the thread.

Another method of configuring and controlling thread execution of a multi-threaded configurable circuit may comprise: providing, using a first instruction memory, a plurality of configuration instructions to configure a data path of the configurable computing circuit; providing a plurality of spoke instructions and datapath configuration instruction indices to select a master synchronization input of a plurality of synchronization network inputs, select a current datapath configuration instruction of the configurable computing circuit, and select a next datapath instruction or a next datapath instruction index of a next configurable computing circuit using a second instruction and instruction index memory; providing, using a plurality of control registers, a completion table having a first data completion count; and using thread control circuitry to queue a thread for execution when, for its thread identifier, the completion count for the thread is decremented to zero and its thread identifier is the next thread.

Another method of controlling thread execution of a multi-threaded configurable circuit may comprise: storing, using a plurality of control registers, a completion table having a plurality of types of thread identifiers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution, and wherein the plurality of control registers further store a top of a stack of thread identifiers; and allowing each type of thread identifier to access the private variable for the selected loop.

Another method of controlling thread execution of a multi-threaded configurable circuit may comprise: storing, using a plurality of control registers, a completion table having a data completion count; providing, using thread control circuitry, a continuation queue storing one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier; and providing, using thread control circuitry, a re-entry queue storing one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers, such that the threads in the re-entry queue execute after a specified spoke count.

Another method of controlling thread execution of a multi-threaded configurable circuit may comprise: storing, using a plurality of control registers, a pool of thread identifiers and a completion table having a loop count of the number of active loop threads; and decrementing, using thread control circuitry, the loop count in response to receiving an asynchronous fabric message that returns a thread identifier to the thread identifier pool, and transmitting an asynchronous fabric completion message when the loop count reaches zero.

In any of the various representative embodiments, the method may further comprise: providing, using the second instruction and instruction index memory, a plurality of spoke instructions and datapath configuration instruction indices to select a current datapath configuration instruction of the configurable computing circuit.

In any of the various representative embodiments, the method may further comprise: providing, using the second instruction and instruction index memory, a plurality of spoke instructions and datapath configuration instruction indices to select a next datapath configuration instruction for a next configurable computational circuit.

In any of the various representative embodiments, the method may further comprise: providing a plurality of spoke instructions and a data path configuration instruction index to select a synchronized network output of the plurality of synchronized network outputs using the second instruction and instruction index memory.

In any of the various representative embodiments, the method may further comprise: providing a first selection setting using a configuration memory multiplexer to select the current datapath configuration instruction using an instruction index from the second instruction and an instruction index memory.

In any of the various representative embodiments, the method may further comprise: providing, using a configuration memory multiplexer, a second selection setting, different from the first setting, to select the current datapath configuration instruction using an instruction index from a master synchronization input.

In any of the various representative embodiments, the method may further comprise: providing a plurality of spoke instructions and datapath configuration instruction indices using the second instruction and instruction index memory to configure a portion of the configurable circuit independent of the current datapath instruction.

In any of the various representative embodiments, the method may further comprise: selecting, using a configuration memory multiplexer, a spoke instruction and a data path configuration instruction index of the plurality of spoke instruction and data path configuration instruction indices according to a modular spoke count.

In any of the various representative embodiments, the method may further comprise: modifying the next datapath instruction or next datapath instruction index using conditional logic circuitry and in dependence upon output from the configurable computation circuitry.

In any of the various representative embodiments, the method may further comprise: providing a conditional branch by modifying the next datapath instruction or next datapath instruction index using conditional logic circuitry and in dependence upon output from the configurable computation circuitry.

In any of the various representative embodiments, the method may further comprise: enabling a conditional logic circuit; and using the conditional logic circuitry and in dependence upon an output from the configurable computation circuitry, specifying the next datapath instruction or datapath instruction index by oring the least significant bit of the next datapath instruction with the output from the configurable computation circuitry, thereby providing a conditional branch.

In any of the various representative embodiments, the method may further comprise: selecting the primary synchronization input using an input multiplexer. In any of the various representative embodiments, the method may further comprise: selecting an output from the configurable computing circuit using an output multiplexer.

In any of the various representative embodiments, the method may further comprise: using an asynchronous fabric state machine coupled to an asynchronous network input queue and an asynchronous network output queue, input data packets received from the asynchronous packet network are decoded and output data packets for transmission over the asynchronous packet network are assembled.

In any of the various representative embodiments, the method may further comprise: providing a plurality of direct point-to-point connections coupling adjacent configurable circuits of the array of the plurality of configurable circuits using the synchronization network.

In any of the various representative embodiments, the method may further comprise: using the configurable circuit, a direct path connection between a plurality of input registers and a plurality of output registers is provided. In any of the various representative embodiments, the direct path connection provides a direct point-to-point connection for data transfer from a second configurable circuit received over the synchronous network to a third configurable circuit transmitted over the synchronous network.

In any of the various representative embodiments, the method may further comprise: using the configurable computing circuitry, performing at least one integer or floating point operation selected from the group consisting of: signed and unsigned addition, absolute value, inverted number, logical negation, addition and inverted number, subtraction A-B, inverse subtraction B-A, signed and unsigned greater or equal, signed and unsigned less or equal, equal or unequal comparisons, logical AND operations, logical OR operations, logical XOR operations, logical NAND operations, logical NOR operations, logical XOR operations, logical NAND operations, and transitions between integer and floating points.

In any of the various representative embodiments, the method may further comprise: using the configurable computing circuitry, performing at least one integer or floating point operation selected from the group consisting of: multiplication, shifting, passing input (pass input), signed and unsigned multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation, conversion between integer and floating point, and combinations thereof.

In any of the various representative embodiments, the method may further comprise: using a scheduling interface circuit, receiving a work descriptor packet over the first interconnection network, and in response to the work descriptor packet, generating one or more data and control packets to the plurality of configurable circuits to configure the plurality of configurable circuits to perform selected calculations.

In any of the various representative embodiments, the method may further comprise: a stop signal is generated when a predetermined threshold is reached in the asynchronous network output queue using a flow control circuit. In any of the various representative embodiments, each asynchronous network output queue stops outputting data packets on the asynchronous packet network in response to the stop signal. In any of the various representative embodiments, each configurable compute circuit, in response to the stall signal, stalls execution after its current instruction is completed.

In any of the various representative embodiments, the method may further comprise: coupling a first plurality of configurable circuits in an array of a plurality of configurable circuits in a first predetermined order through the synchronization network to form a first synchronization domain; and coupling a second plurality of configurable circuits in the array of the plurality of configurable circuits in a second predetermined order through the synchronization network to form a second synchronization domain.

In any of the various representative embodiments, the method may further comprise: generating a continuation message from the first synchronous domain to the second synchronous domain for transmission over the asynchronous packet network.

In any of the various representative embodiments, the method may further comprise: generating a completion message from the second synchronous domain to the first synchronous domain for transmission over the asynchronous packet network. In any of the various representative embodiments, the method may further include storing a completion table having a first data completion count in the plurality of control registers.

In any of the various representative embodiments, the method may further comprise: storing the completion table with a second iteration count in the plurality of control registers.

In any of the various representative embodiments, the method may further comprise: a loop table having a plurality of thread identifiers is stored in the plurality of control registers, and for each thread identifier, a next thread identifier is stored for execution after execution of a current thread.

In any of the various representative embodiments, the method may further comprise: storing an identification of a first iteration and an identification of a last iteration in the loop table in the plurality of control registers.

In any of the various representative embodiments, the method may further comprise: using the control circuitry, a thread is queued for execution when its completion count is decremented to zero for the thread's thread identifier and its thread identifier is the next thread.

In any of the various representative embodiments, the method may further comprise: using the control circuitry, a thread is queued for execution when a completion count for the thread indicates completion of any data dependencies for the thread's thread identifier. In any of the various representative embodiments, the completion count may indicate, for each selected thread of the plurality of threads, a predetermined number of completion messages received before execution of the selected thread.

In any of the various representative embodiments, the method may further comprise: a completion table having a plurality of types of thread identifiers is stored in the plurality of control registers, wherein each type of thread identifier indicates a loop level for loop and nested loop execution.

In any of the various representative embodiments, the method may further comprise: storing, in the plurality of control registers, a completion table having a loop count of the number of active loop threads, and wherein in response to receiving an asynchronous fabric message that returns a thread identifier to a thread identifier pool, the loop count is decremented using the control circuitry and the asynchronous fabric completion message is transmitted when the loop count reaches zero.

In any of the various representative embodiments, the method may further comprise: storing the top of the thread identifier stack in the plurality of control registers to allow each type of thread identifier to access the private variable for the selected loop.

In any of the various representative embodiments, the method may further comprise: using the continue queue, one or more thread identifiers for computing threads having completion counts allowed for execution but not yet having an assigned thread identifier are stored.

In any of the various representative embodiments, the method may further comprise: using the re-entry queue, one or more thread identifiers for computing threads having completion counts allowed for execution and having assigned thread identifiers are stored.

In any of the various representative embodiments, the method may further comprise: executing any threads having thread identifiers in the re-enqueue before executing any threads having thread identifiers in the resume queue.

In any of the various representative embodiments, the method may further comprise: executing any threads having a thread identifier in a priority queue prior to executing any threads having a thread identifier in the continue queue or the re-enter queue.

In any of the various representative embodiments, the method may further comprise: any thread in the run queue is executed after the spoke count for the thread identifier occurs.

In any of the various representative embodiments, the method may further comprise: using the control circuitry, computing threads are self-scheduled for execution.

In any of the various representative embodiments, the method may further comprise: using the conditional logic circuit, branching to a different second next instruction for execution by a next configurable circuit.

In any of the various representative embodiments, the method may further comprise: using the control circuitry, the computing threads are ordered for execution.

In any of the various representative embodiments, the method may further comprise: using the control circuitry, the loop computation thread is ordered for execution.

In any of the various representative embodiments, the method may further comprise: using the control circuitry, execution of a computing thread is commenced in response to one or more completion signals from the data dependency.

A self-scheduling processor is disclosed. In a representative embodiment, the processor comprises: a processor core to execute the received instructions; and core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to received work descriptor data packets. In another representative embodiment, the processor comprises: a processor core to execute the received instructions; and core control circuitry coupled to the processor core, the core control circuitry to automatically schedule instructions for execution by the processor core in response to received event data packets.

A multithreaded self-scheduling processor is also disclosed which may create threads on local or remote computing elements. In a representative embodiment, the processor comprises: a processor core to execute a shred creation instruction; and core control circuitry coupled to the processor core, the core control circuitry to automatically schedule the shred creation instructions for execution by the processor core and generate one or more job descriptor data packets destined for another processor or hybrid thread fabric circuitry to execute a corresponding plurality of execution threads. In another representative embodiment, the processor comprises: a processor core to execute a shred creation instruction; and core control circuitry coupled to the processor core, the core control circuitry to schedule the shred creation instructions for execution by the processor core, reserve a predetermined amount of memory space in a thread control memory to store return arguments, and generate one or more job descriptor data packets destined for another processor or a hybrid thread fabric circuit to execute a corresponding plurality of execution threads.

In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface; a thread control memory coupled to the interconnect network interface; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory; and an instruction cache coupled to the control logic and thread selection circuitry; the processor additionally includes a processor core coupled to the instruction cache of the core control circuitry.

In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface; a thread control memory coupled to the interconnect network interface; a network response memory; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory; an instruction cache coupled to the control logic and thread selection circuitry; and a command queue; the processor additionally includes a processor core coupled to the instruction cache and the command queue of the core control circuitry.

In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier to execute the execution thread.

In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution by the processor core of an instruction executing a thread.

In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution of instructions of the execution thread by the processor core, the processor core using data stored in the data cache or general purpose register.

In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, and periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.

In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a program count register to store a received program count, and a thread status register to store an active status or a suspended status for each thread identifier of the plurality of thread identifiers; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue when the thread identifier has a valid state, periodically select the thread identifier for the processor core to execute instructions of the execution thread for a duration that the valid state remains unchanged, and suspend thread execution by not returning the thread identifier to the execution queue when the thread identifier has a suspended state.

In another representative embodiment, a processor comprises: a processor core and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: a thread control memory comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; and control logic and thread selection circuitry coupled to the execution queue, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, and periodically select the thread identifier for execution by the processor core of an instruction executing a thread.

In another representative embodiment, a processor comprises: a processor core to execute a plurality of instructions; and core control circuitry coupled to the processor core, wherein the core control circuitry comprises: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and an instruction cache coupled to the processor core and the control logic and thread selection circuitry to receive the initial program count and provide a corresponding one of the plurality of instructions to the processor core for execution.

In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, a data cache, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, automatically place the thread identifier in the execution queue, periodically select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; the processor additionally includes a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.

In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and a command queue; the processor additionally includes a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.

In another representative embodiment, a processor comprises: a core control circuit coupled to the interconnect network interface and comprising: an interconnection network interface coupleable to the interconnection network to receive the job descriptor data packet, decode the received job descriptor data packet into an execution thread having an initial program count and any received arguments; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; control logic and thread selection circuitry coupled to the execution queue and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; the processor additionally includes a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.

In another representative embodiment, a processor comprises: a core control circuit, comprising: an interconnection network interface coupleable to the interconnection network to receive the invoke work descriptor packet, decode the received work descriptor packet into an execution thread having an initial program count and any received arguments, and encode the work descriptor packet for transmission to other processing elements; a thread control memory coupled to the interconnect network interface and comprising a plurality of registers including a thread identifier pool register to store a plurality of thread identifiers, a thread status register, a program count register to store a received program count, and a general purpose register to store a received argument; an execution queue coupled to the thread control memory; a network response memory coupled to the interconnect network interface; control logic and thread selection circuitry coupled to the execution queue, the thread control memory, and an instruction cache, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution; and a command queue storing one or more commands to generate one or more work descriptor packets; the processor additionally includes a processor core coupled to the instruction cache and the command queue of the core control circuitry, the processor core to execute the corresponding instruction.

For any of the various representative embodiments, the core control circuitry may further comprise: an interconnection network interface coupleable to an interconnection network, the interconnection network interface to receive a work descriptor packet, decode the received work descriptor packet into an execution thread having an initial program count and any received arguments. For any of the various representative embodiments, the interconnection network interface may be further operable to receive an event data packet, decode the received event data packet into an event identifier and any received arguments.

For any of the various representative embodiments, the core control circuitry may further comprise: control logic and thread selection circuitry coupled to the interconnect network interface, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread.

For any of the various representative embodiments, the core control circuitry may further comprise: a thread control memory having a plurality of registers, wherein the plurality of registers comprises one or more of the following in any selected combination: a thread identifier pool register storing a plurality of thread identifiers; a thread state register; a program count register to store the received initial program count; a general register storing the received argument; a pending fiber return count register; returning to the argument buffer or register; returning to the argument chain table register; self-defining an atomic transaction identifier register; an event status register; an event mask register; and data caching.

For any of the various representative embodiments, the interconnect network interface may be further operable to store the execution thread with the initial program count and any received arguments in the thread control memory using a thread identifier as an index to the thread control memory.

For any of the various representative embodiments, the core control circuitry may further comprise: control logic and thread selection circuitry coupled to the thread control memory and the interconnect network interface, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread.

For any of the various representative embodiments, the core control circuitry may further comprise: an execution queue coupled to the thread control memory, the execution queue storing one or more thread identifiers.

For any of the various representative embodiments, the core control circuitry may further comprise: control logic and thread selection circuitry coupled to the execution queue, the interconnect network interface, and the thread control memory, the control logic and thread selection circuitry to assign an available thread identifier to the execution thread, place the thread identifier in the execution queue, select the thread identifier for execution, and access the thread control memory using the thread identifier as an index to select the initial program count for the execution thread.

For any of the various representative embodiments, the core control circuitry may further comprise: an instruction cache coupled to the control logic and thread selection circuitry to receive the initial program count and provide corresponding instructions for execution.

In another representative embodiment, the processor may further comprise: a processor core coupled to the instruction cache of the core control circuitry, the processor core to execute the corresponding instruction.

For any of the various representative embodiments, the core control circuitry may be further operative to assign an initial valid state to the execution thread. For any of the various representative embodiments, the core control circuitry may be further operative to assign a halt state to the execution thread in response to the processor core executing a memory load instruction. For any of the various representative embodiments, the core control circuitry may be further operative to allocate a halt state to the execution thread in response to the processor core executing a memory store instruction.

For any of the various representative embodiments, the core control circuitry may be further operative to end execution of a selected thread in response to the processor core executing a return instruction. For any of the various representative embodiments, the core control circuitry may be further operative to return the corresponding thread identifier for the selected thread to the thread identifier pool register in response to the processor core executing a return instruction. For any of the various representative embodiments, the core control circuitry may be further operative to clear a register of the thread control memory indexed by the corresponding thread identifier of the selected thread in response to the processor core executing a return instruction.

For any of the various representative embodiments, the interconnect network interface may be further operative to generate a return job descriptor packet in response to the processor core executing a return instruction.

For any of the various representative embodiments, the core control circuitry may further comprise: the network response memory. For any of the various representative embodiments, the network response memory may comprise one or more of the following in any selected combination: a memory request register; a thread identifier and transaction identifier register; requesting a cache line index register; a byte register; and general register index and type registers.

For any of the various representative embodiments, the interconnection network interface may be used to generate a point-to-point event data message. For any of the various representative embodiments, the interconnection network interface may be operative to generate a broadcast event data message.

For any of the various representative embodiments, the core control circuitry may be further operative to respond to a received event data packet using an event mask stored in the event mask register. For any of the various representative embodiments, the core control circuitry may be further operative to determine an event number corresponding to the received event data packet. For any of the various representative embodiments, the core control circuitry may be further operative to change the state of the thread identifier from suspended to active in response to the received event data packet to resume execution of the corresponding thread of execution. For any of the various representative embodiments, the core control circuitry may be further operative to change the state of the thread identifier from suspended to valid in response to the event number of the received event data packet to resume execution of the corresponding thread of execution.

For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to successively select a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to perform a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to perform a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread until the execution thread is completed. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to perform a barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.

For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to assign an active state or a suspended state to the thread identifier. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to assign a priority status to the thread identifier. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to return a corresponding thread identifier having an assigned valid state and an assigned priority to the execution queue after execution of a corresponding instruction.

For any of the various representative embodiments, the core control circuitry may further comprise: a network command queue coupled to the processor core.

For any of the various representative embodiments, the interconnection network interface may include: inputting a queue; a packet decoder circuit coupled to the input queue, the control logic and thread selection circuit, and the thread control memory; an output queue; and a packet encoder circuit coupled to the output queue, the network response memory, and the network command queue.

For any of the various representative embodiments, the execution queue may further comprise: a first priority queue; and a second priority queue. For any of the various representative embodiments, the control logic and thread selection circuitry may further comprise: thread selection control circuitry coupled to the execution queue, the thread selection control circuitry to select a thread identifier from the first priority queue at a first frequency and to select a thread identifier from the second priority queue at a second frequency, the second frequency being lower than the first frequency. For any of the various representative embodiments, the thread selection control circuitry may be operative to determine the second frequency as a skip count starting with selection of a thread identifier from the first priority queue.

For any of the various representative embodiments, the core control circuitry may further comprise: data path control circuitry for controlling access size through the first interconnection network. For any of the various representative embodiments, the core control circuitry may further comprise: data path control circuitry to increase or decrease a memory load access size in response to a time-averaged usage level. For any of the various representative embodiments, the core control circuitry may further comprise: data path control circuitry to increase or decrease a memory storage access size in response to a time-averaged usage level. For any of the various representative embodiments, the control logic and thread selection circuitry may be further operative to increase a size of a memory load access request to correspond to a cache line boundary of the data cache.

For any of the various representative embodiments, the core control circuitry may further comprise: system call circuitry to generate one or more system calls to a host processor. For any of the various representative embodiments, the system call circuitry may further comprise: a plurality of system call credit registers storing a predetermined credit count to modulate the number of system calls in any predetermined time period.

For any of the various representative embodiments, the core control circuitry may be further to generate a command to cause the command queue of the interconnect network interface to copy and transmit all data corresponding to a selected thread identifier from the thread control memory to monitor thread status in response to a request from a host processor.

For any of the various representative embodiments, the processor core may be used to execute a shred creation instruction to generate one or more commands that cause the command queue of the interconnect network interface to generate one or more call work descriptor packets destined for another processor core or a hybrid thread fabric circuit. For any of the various representative embodiments, the core control circuitry may be further operative to reserve a predetermined amount of memory space in the general purpose register or a return argument register in response to the processor core executing a shred creation instruction.

For any of the various representative embodiments, the core control circuitry may be operative to store a thread return count in the thread return register in response to generating one or more call work descriptor packets destined for another processor core or a hybrid thread fabric. For any of the various representative embodiments, in response to receiving a return data packet, the core control circuitry may be operative to decrement the thread return count stored in the thread return register. For any of the various representative embodiments, in response to the thread return count in the thread return register decrementing to zero, the core control circuitry may be operative to change the suspended state of the corresponding thread identifier to a valid state for subsequent execution of a thread return instruction to complete the created fiber or thread.

For any of the various representative embodiments, the processor core may be used to execute a wait or a non-wait fiber join instruction. For any of the various representative embodiments, the processor core may be operative to execute a fibre join instruction.

For any of the various representative embodiments, the processor core may be operative to execute a non-cache read or load instruction to specify a general purpose register for storing data received from memory. For any of the various representative embodiments, the processor core may be operative to execute a non-cache write or store instruction to designate data in a general purpose register for storage in memory.

For any of the various representative embodiments, the core control circuitry may be operative to assign a transaction identifier to any load or store request to memory and correlate the transaction identifier with a thread identifier.

For any of the various representative embodiments, the processor core may be operative to execute a first thread priority instruction to assign a first priority to an execution thread having a corresponding thread identifier. For any of the various representative embodiments, the processor core may be operative to execute a second thread priority instruction to assign a second priority to the execution thread having the corresponding thread identifier.

For any of the various representative embodiments, the processor core may be operative to execute a custom atomic return instruction to complete a thread of execution of a custom atomic operation. For any of the various representative embodiments, in conjunction with a memory controller, the processor core may be used to perform floating point atomic memory operations. For any of the various representative embodiments, in conjunction with a memory controller, the processor core may be used to perform custom atomic memory operations.

Also disclosed is a method of self-scheduled execution of instructions, wherein an exemplary method embodiment comprises: receiving a work descriptor data packet; and automatically scheduling instructions for execution in response to the received job descriptor data packet.

Another method of self-scheduled execution of instructions is also disclosed, wherein one representative method embodiment comprises: receiving an event data packet; and automatically scheduling instructions for execution in response to the received event data packet.

Also disclosed is a method of a first processing element generating a plurality of threads of execution for execution by a second processing element, wherein a representative method embodiment comprises: executing a fiber program creating instruction; and in response to execution of the shred creation instruction, generating one or more job descriptor data packets destined for the second processing element to execute the plurality of execution threads.

Also disclosed is a method of a first processing element generating a plurality of threads of execution for execution by a second processing element, wherein a representative method embodiment comprises: executing a fiber program creating instruction; and in response to executing the shred creation instruction, reserving a predetermined amount of memory space in a thread control memory to store return arguments and generating one or more job descriptor data packets destined for the second processing element to execute the plurality of execution threads.

Also disclosed is a method of self-scheduled execution of instructions, wherein an exemplary method embodiment comprises: receiving a work descriptor data packet; decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments; assigning an available thread identifier to the execution thread; automatically queuing the thread identifier for execution of the execution thread; and periodically selecting the thread identifier to execute the execution thread.

Another method of self-scheduled execution of instructions is also disclosed, wherein one representative method embodiment comprises: receiving a work descriptor data packet; decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments; assigning an available thread identifier to the execution thread; automatically queuing the thread identifier for execution of the execution thread when the thread identifier has a valid state; and periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.

Another method of self-scheduled execution of instructions is also disclosed, wherein one representative method embodiment comprises: receiving a work descriptor data packet; decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments; assigning an available thread identifier to the execution thread; automatically queuing the thread identifier in an execution queue for execution of the execution thread when the thread identifier has a valid state; and periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged; and when the thread identifier has a suspended state, suspending thread execution by not returning the thread identifier to the execution queue.

Another method of self-scheduled execution of instructions is also disclosed, wherein one representative method embodiment comprises: receiving a work descriptor data packet; decoding the received job descriptor packet into an execution thread having an initial program count and any received arguments; storing the initial program count and any received arguments in a thread control memory; assigning an available thread identifier to the execution thread; automatically queuing the thread identifier for execution of the execution thread when the thread identifier has a valid state; accessing the thread control memory using the thread identifier as an index to select the initial program count for the execution thread; and periodically selecting the thread identifier to execute instructions of the execution thread for a duration in which the valid state remains unchanged until the execution thread is completed.

For any of the various representative embodiments, the method may further comprise: receiving an event data packet; and decoding the received event data packet into an event identifier and any received arguments.

For any of the various representative embodiments, the method may further comprise: an initial valid state is assigned to the execution thread.

For any of the various representative embodiments, the method may further comprise: a suspend state is assigned to the execution thread in response to executing a memory load instruction. For any of the various representative embodiments, the method may further comprise: a suspend state is assigned to the execution thread in response to executing a memory store instruction.

For any of the various representative embodiments, the method may further comprise: in response to executing the return instruction, execution of the selected thread is terminated. For any of the various representative embodiments, the method may further comprise: in response to executing a return instruction, returning a corresponding thread identifier for the selected thread to the thread identifier pool. For any of the various representative embodiments, the method may further comprise: in response to executing a return instruction, clearing registers of a thread control memory indexed by the corresponding thread identifier of the selected thread. For any of the various representative embodiments, the method may further comprise: in response to executing the return instruction, a return job descriptor packet is generated.

For any of the various representative embodiments, the method may further comprise: a point-to-point event data message is generated. For any of the various representative embodiments, the method may further comprise: a broadcast event data message is generated.

For any of the various representative embodiments, the method may further comprise: the received event data packet is responded to with an event mask. For any of the various representative embodiments, the method may further comprise: an event number corresponding to the received event data packet is determined. For any of the various representative embodiments, the method may further comprise: the state of the thread identifier is changed from paused to active in response to the received event data packet to resume execution of the corresponding thread of execution. For any of the various representative embodiments, the method may further comprise: the state of the thread identifier is changed from suspended to valid in response to the event number of the received event data packet to resume execution of the corresponding execution thread.

For any of the various representative embodiments, the method may further comprise: successively selecting a next thread identifier from the execution queue to execute a single instruction of a corresponding execution thread. For any of the various representative embodiments, the method may further comprise: performing a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers for executing a single instruction of a corresponding execution thread, respectively. For any of the various representative embodiments, the method may further comprise: performing a round-robin selection of a next thread identifier from the execution queue among the plurality of thread identifiers for executing a single instruction of a corresponding execution thread, respectively, until the execution thread is completed. For any of the various representative embodiments, the method may further comprise: performing a barrel selection of a next thread identifier from the execution queue among the plurality of thread identifiers, each for executing a single instruction of a corresponding execution thread.

For any of the various representative embodiments, the method may further comprise: a valid state or a suspended state is assigned to the thread identifier. For any of the various representative embodiments, the method may further comprise: a priority status is assigned to the thread identifier.

For any of the various representative embodiments, the method may further comprise: after executing a corresponding instruction, returning the corresponding thread identifier with an assigned valid state and an assigned priority to the execution queue.

For any of the various representative embodiments, the method may further comprise: selecting a thread identifier from a first priority queue at a first frequency and selecting a thread identifier from a second priority queue at a second frequency, the second frequency being lower than the first frequency. For any of the various representative embodiments, the method may further comprise: determining the second frequency as a skip count from selecting a thread identifier from the first priority queue.

For any of the various representative embodiments, the method may further comprise: controlling the data path access size. For any of the various representative embodiments, the method may further comprise: the memory load access size is increased or decreased in response to the time-averaged usage level. For any of the various representative embodiments, the method may further comprise: the memory storage access size is increased or decreased in response to the time-averaged usage level. For any of the various representative embodiments, the method may further comprise: the size of the memory load access request is increased to correspond to a cache line boundary of the data cache.

For any of the various representative embodiments, the method may further comprise: one or more system calls are generated to a host processor. For any of the various representative embodiments, the method may further comprise: the number of system calls within any predetermined time period is modulated using a predetermined credit count.

For any of the various representative embodiments, the method may further comprise: in response to a request from the host processor, all data from the thread control memory corresponding to the selected thread identifier is copied and transferred to monitor thread status.

For any of the various representative embodiments, the method may further comprise: the fabric creation instruction is executed to generate one or more commands that generate one or more call job descriptor packets destined for another processor core or a hybrid thread fabric circuit. For any of the various representative embodiments, the method may further comprise: in response to executing the shred creation instruction, a predetermined amount of memory space is reserved to store any return arguments. For any of the various representative embodiments, the method may further comprise: storing a thread return count in the thread return register in response to generating one or more call work descriptor packets. For any of the various representative embodiments, the method may further comprise: in response to receiving a return data packet, the thread return count stored in the thread return register is decremented. For any of the various representative embodiments, the method may further comprise: responsive to the thread return count in the thread return register decrementing to zero, the suspended state of the corresponding thread identifier is changed to an active state for subsequent execution of a thread return instruction to complete the created fiber or thread.

For any of the various representative embodiments, the method may further comprise: a wait or non-wait fiber join instruction is executed. For any of the various representative embodiments, the method may further comprise: and executing the all-fiber adding instruction.

For any of the various representative embodiments, the method may further comprise: a non-cache read or load instruction is executed to specify a general purpose register for storing data received from memory.

For any of the various representative embodiments, the method may further comprise: a non-cache write or store instruction is executed to specify data in the general purpose registers for storage in memory.

For any of the various representative embodiments, the method may further comprise: the transaction identifier is assigned to any load or store request to memory and is correlated with the thread identifier.

For any of the various representative embodiments, the method may further comprise: a first thread priority instruction is executed to assign a first priority to an execution thread having a corresponding thread identifier. For any of the various representative embodiments, the method may further comprise: a second thread priority instruction is executed to assign a second priority to the execution thread having the corresponding thread identifier.

For any of the various representative embodiments, the method may further comprise: and executing the self-defined atomic return instruction to complete the execution thread of the self-defined atomic operation.

For any of the various representative embodiments, the method may further comprise: a floating point atomic memory operation is performed.

For any of the various representative embodiments, the method may further comprise: a custom atomic memory operation is performed.

Many other advantages and features of the invention will become apparent from the following detailed description of the invention and the examples thereof, the claims, and the accompanying drawings.

Drawings

The objects, features and advantages of the present invention will be more readily understood by reference to the following disclosure when considered in connection with the accompanying drawings in which like reference numerals are used to designate like components in the various figures and in which reference numerals with alphabetic characters are used to designate additional types, examples or variations of selected component embodiments in the various figures, wherein:

FIG. 1 is a block diagram of a representative first embodiment of a hybrid computing system.

FIG. 2 is a block diagram of a representative second embodiment of a hybrid computing system.

FIG. 3 is a block diagram of a representative third embodiment of a hybrid computing system.

FIG. 4 is a block diagram of a representative embodiment of a hybrid thread fabric having configurable computing circuitry coupled to a first interconnection network.

FIG. 5 is a high-level block diagram of a portion of a representative embodiment of a hybrid thread fabric circuit group.

Fig. 6 is a high-level block diagram of a second interconnect network within a hybrid thread fabric circuit group.

FIG. 7 is a detailed block diagram of a representative embodiment of a hybrid thread fabric circuit group.

FIG. 8 is a detailed block diagram of a representative embodiment of a hybrid thread fabric configurable computing circuit (tile).

Fig. 9A and 9B (collectively fig. 9) are detailed block diagrams of representative embodiments of hybrid thread fabric configurable computing circuits (tiles).

FIG. 10 is a detailed block diagram of a representative embodiment of a memory control circuit of a hybrid thread fabric configurable computing circuit (tile).

FIG. 11 is a detailed block diagram of a representative embodiment of thread control circuitry of a hybrid thread fabric configurable computing circuit (tile).

FIG. 12 is a diagram of representative hybrid thread fabric configurable computing circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging.

FIG. 13 is a block diagram of a representative embodiment of a memory interface.

FIG. 14 is a block diagram of a representative embodiment of a scheduling interface.

Fig. 15 is a block diagram of a representative embodiment of an optional first network interface.

FIG. 16 is a diagram of representative hybrid thread fabric configurable compute circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging for a group of hybrid thread fabric circuits to perform computations.

17A, 17B, and 17C (collectively FIG. 17) are flow diagrams of representative asynchronous packet network messaging and execution by a hybrid thread fabric configurable compute circuit (tile) for a hybrid thread fabric group to perform the computation of FIG. 16.

FIG. 18 is a diagram of representative hybrid thread fabric configurable compute circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging for a group of hybrid thread fabric circuits to perform computations.

Fig. 19A and 19B (collectively fig. 19) are flow diagrams of representative asynchronous packet network messaging and execution by a hybrid thread fabric configurable compute circuit (tile) for a group of hybrid thread fabric circuits to perform the computation of fig. 18.

FIG. 20 is a diagram of representative hybrid thread fabric configurable compute circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging for a group of hybrid thread fabric circuits to perform a compute loop.

Fig. 21 is a flow diagram of representative asynchronous packet network messaging and execution by a hybrid thread fabric configurable compute circuit (tile) for a group of hybrid thread fabric circuits to perform a loop in the computation of fig. 20.

FIG. 22 is a block diagram of a representative flow control circuit.

FIG. 23 is a diagram of representative hybrid thread fabric configurable compute circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging and synchronous messaging for a group of hybrid thread fabric circuits to perform a compute loop.

FIG. 24 is a circuit block diagram of a representative embodiment of conditional branch circuitry.

Fig. 25 is a high-level block diagram of a representative embodiment of a hybrid-thread processor 300.

FIG. 26 is a detailed block diagram of a representative embodiment of a thread memory of a hybrid thread processor.

FIG. 27 is a detailed block diagram of a representative embodiment of a network response memory of a hybrid thread processor.

FIG. 28 is a detailed block diagram of a representative embodiment of a mixed-threaded processor.

Fig. 29A and 29B (collectively fig. 29) are a flow diagram of a representative embodiment of a method for mixing self-scheduling and thread control of a threaded processor.

FIG. 30 is a detailed block diagram of a representative embodiment of thread selection control circuitry that mixes the control logic and thread selection circuitry of a threaded processor.

FIG. 31 is a block diagram of a representative embodiment and a representative data packet of a portion of a first interconnection network.

FIG. 32 is a detailed block diagram of a representative embodiment of data path control circuitry of a mixed-thread processor.

FIG. 33 is a detailed block diagram of representative embodiments of system call circuitry and host interface circuitry of a mixed-thread processor.

Fig. 34 is a block diagram of a representative first embodiment of a first interconnection network.

Fig. 35 is a block diagram of a representative second embodiment of the first interconnection network.

Fig. 36 is a block diagram of a representative third embodiment of the first interconnection network.

FIG. 37 illustrates a representative virtual address space format supported by the system architecture.

Fig. 38 shows a representative conversion process for each virtual address format.

FIG. 39 illustrates a representative send call instance for a hybrid thread.

FIG. 40 shows a representative transmit fork example for a hybrid thread.

FIG. 41 illustrates a representative send-pass example for a hybrid thread.

FIG. 42 illustrates a representative call chain use case for a hybrid thread.

Detailed Description

While this invention is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific exemplary embodiments of the invention with the understanding that the present disclosure is to be considered as an exemplification of the principles of the invention and is not intended to limit the invention to the specific embodiments illustrated. In this respect, before explaining at least one embodiment in detail, it is to be understood that the invention is not limited in its application to the details of construction and to the arrangements of the components set forth above and illustrated in the following description, illustrated in the drawings, or otherwise described by way of example. The methods and apparatus according to the present invention are capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein, as well as the abstract included below, are for the purpose of description and should not be regarded as limiting.

I. Hybrid computing system 100 and interconnection network:

fig. 1, 2, and 3 are block diagrams of representative first, second, and third embodiments of hybrid computing systems 100A, 100B, 100C (collectively referred to as system 100). FIG. 4 is a block diagram of a representative embodiment of a hybrid thread fabric ("HTF") 200 having configurable computing circuitry coupled to a first interconnection network 150 (also referred to simply as "NOC", on behalf of "network on chip"). Fig. 5 is a high-level block diagram of a portion of a representative embodiment of a hybrid thread fabric circuit group 205 having a second interconnect network 250. Fig. 6 is a high-level block diagram of a second interconnection network within the hybrid thread fabric group 205. FIG. 7 is a detailed block diagram of a representative embodiment of a Hybrid Thread Fabric (HTF) group 205.

FIG. 8 is a high-level block diagram of a representative embodiment of a hybrid thread-fabric configurable computing circuit 210, referred to as a "tile" 210. FIG. 9 is a detailed block diagram of a representative embodiment of a hybrid thread fabric configurable computing circuit 210A, referred to as a "tile" 210A, as a particular representative example of a tile 210. Unless specifically referred to as tile 210A, references to tile 210 shall refer individually and collectively to tile 210 and tile 210A. The hybrid thread fabric configurable computing circuits 210 are referred to as "tiles" 210 because, in representative embodiments, all such hybrid thread fabric configurable computing circuits 210 are identical to one another and may be arranged and connected in any order, i.e., each hybrid thread fabric configurable computing circuit 210 may be "tiled" to form a hybrid thread fabric group 205.

Referring to fig. 1-9, the hybrid computing system 100 includes a hybrid thread processor ("HTP") 300, discussed in more detail below with reference to fig. 25-33, coupled to one or more hybrid thread fabric ("HTF") circuits 200 through a first interconnect network 150. It should be understood that, as used herein, the term "fabric" refers to and encompasses an array of computing circuits, in which case the computing circuits are reconfigurable computing circuits. Fig. 1, 2, and 3 illustrate different arrangements of systems 100A, 100B, and 100C, including additional components forming relatively larger and smaller systems 100, all within the scope of the present disclosure. As shown in fig. 1 and 2, which may each be an arrangement suitable for a system on a chip ("SOC"), for example, but not limited to, the hybrid computing system 100A, 100B may also optionally include a memory controller 120 in various combinations as shown, which memory controller 120 may be coupled to a memory 125 (which may also be a separate integrated circuit), any of various communication interfaces 130 (e.g., a PCIe communication interface), one or more host processors 110, and a host interface ("HIF") 115. As shown in fig. 3, which may each be an arrangement suitable for a "chiplet" configuration on a common substrate 101, for example, but not limited to, with or without these other components, the hybrid computing system 100C may also optionally include a communication interface 130. These arrangements are all within the scope of the present disclosure and are collectively referred to herein as system 100. Any of these hybrid computing systems 100 may also be considered "nodes" operating under a single operating system ("OS"), and may also be coupled to other such local and remote nodes.

Each node of system 100 runs a separate Operating System (OS) instance, thereby controlling the resources of the associated node. An application that spans multiple nodes is executed by coordinating multiple OS instances of the spanned nodes. The processes associated with the applications running on each node have an address space that provides access to node private memory and global shared memory distributed across the nodes. Each OS instance contains drivers that manage local node resources. The shared address space of the application is commonly managed by a set of drivers running on the node. The shared address space is assigned a global space id (gsid). The number of global spaces that are active at any given time is expected to be relatively small. GSID is set to 8 bits wide.

As used herein, hybrid threads refer to the ability to spawn multiple computing fibers and threads across different heterogeneous types of processing circuits (hardware), e.g., across HTF circuit 200 (as a reconfigurable computing fabric) and across processors, such as HTP300 or another type of RISC-V processor. Hybrid threads also refer to programming languages/styles in which a worker thread transitions from one compute element to the next to move the computation to the location where the data is located, such transitions also being implemented in representative embodiments. The host processor 110 is typically a multi-core processor that may be embedded within the hybrid computing system 100 or an external host processor that is coupled to the hybrid computing system 100 through a communication interface 130 (e.g., a PCIe-based interface). These processors, such as 300, and the one or more host processors 110 are described in more detail below.

Memory controller 120 may be implemented as is known, or become known in the electronic arts. Alternatively, in a representative embodiment, the memory controller 120 may be implemented as described in the related application. The first memory 125 may also be implemented as known, or become known in the electronic arts, as described in more detail below.

Also in the representative embodiment, the HTP300 is a RISC-V ISA based multithreaded processor having one or more processor cores 705 with an extended instruction set, as well as one or more core control circuits 710 and one or more second memories 715, referred to as core control (or thread control) memories 715, as discussed in more detail below. In general, the HTP300 provides a bucket-wise round-robin instantaneous thread switch to maintain a high instruction-per-clock rate.

For purposes herein, HIF 115 causes host processor 110 to send work to HTP300 and HTF circuit 200, and HTP300 to send work to HTF circuit 200, each as a "work descriptor packet" that is transmitted over first interconnection network 150. A general mechanism is provided to start and end operation on the HTP300 and HTF circuit 200: work on the HTP300 and HTF circuit 200 is started with a "call" work descriptor packet and work on the HTP300 and circuit 200 is ended with a "return" work descriptor packet. HIF 115 contains scheduling circuitry and queues (referred to simply as "scheduling queues" 105), which also provide management functionality for monitoring the load provided to HTF circuitry 200 and/or HTPs 300 and the availability of resources for HTF circuitry 200 and/or HTPs 300. When resources are available on the HTF circuits 200 and/or HTPs 300, the dispatch queue 105 determines the least loaded HTF circuit 200 and/or HTP300 resources. In the event that multiple HTF circuit groups 205 are loaded with the same or similar work, it selects the HTF circuit group 205 that is currently executing the same core (if possible) (to avoid loading or reloading the core configuration). For example, similar functionality of HIF 115 may also be included in HTP300, particularly for arrangements of system 100 that may not include HIF 115 alone. The functions of other HIF 115 are described in more detail below. HIF 115 may be implemented as is known or becomes known in the electronic arts, e.g., as one or more state machines with registers (forming FIFOs, queues, etc.).

The first interconnection network 150 is a packet-based communication network that provides packet routing between the HTF circuit 200, the mixed-thread processor 300, and other optional components (e.g., the memory controller 120, the communication interface 130, and the host processor 110). For purposes of this disclosure, the first interconnection network 150 forms part of an asynchronous switch fabric ("AF"), which means that data packets may be routed along any of a variety of paths such that arrival of any selected data packet to an addressed destination may occur at any of a number of different times depending on the route. This is in contrast to the synchronous mesh communication network 275 of the second interconnection network 250, which is discussed in more detail below.

Fig. 31 is a diagram of a representative embodiment and a representative data packet of a portion of the first interconnection network 150. In the representative embodiment, the first interconnection network 150 comprises a network bus structure 152 (a plurality of wires or lines) with a first plurality of network lines 154 dedicated to addressing (or routing) data packets (158) and for setting up data paths through the various cross-bar switches, and a remaining second plurality of network lines 156 dedicated to routing data packets (data loads, shown as a series or series of "N" data packets 162) containing operand data, arguments, results, etc. through the paths₁To 162_N) The path is established by an addressing line (first plurality of network lines 154). Two such network bus structures 152 are typically provided as a channel to and from each computing resource, a first channel for receiving data and a second channel for transmitting data. A single first addressing (or routing) packet (shown as addressing (or routing) packet 158₁) May be used to establish a route to the first specified destination and may be followed (generally after a number of clock cycles to allow the switch to be set) by one or more data packets 162 to be transmitted to the first specified destination up to a predetermined number of data packets 162 (e.g., up to N data packets). While the predetermined number of data packets 162 are being routed, another second addressing (or routing) data packet (shown as addressing (or routing) data packet 158 may be transmitted and utilized₂) To establish a route to a second designated destination for other subsequent one or more data packets 162 to be destined for this second designated destination (shown as data packet 162)_N+1)。

Fig. 34-36 are block diagrams of representative first, second, and third embodiments of a first interconnection network 150, showing, by way of example, various topologies of the first interconnection network 150, such as first interconnection networks 150A, 150B, 150C (which are collectively referred to herein as first interconnection networks 150). The first interconnection network 150 is generally embodied as a plurality of crossbars 905, 910 having a folded close configuration, shown as a central (or center) crossbar 905 coupled to peripheral (or edge) crossbars 910 by queues 925, where the peripheral crossbars 910 are in turn coupled (also by queues 925) to a mesh network 920, the mesh network 920 providing, for example, a plurality of additional direct connections 915 between chiplets, e.g., up, down, left, right, depending on the embodiment of the system 100. Many network topologies are available and are within the scope of the present disclosure, such as shown in fig. 35 and 36, where the first interconnection network 150B, 150C further comprises an endpoint crossbar 930.

Routing through any of the various first interconnection networks 150 involves load balancing such that packets moving from the peripheral (or edge) crossbar 910 towards the central (or center) crossbar 905 may be routed through any available crossbar of the central (or center) crossbar 905 and packets moving from the endpoint crossbar 930 towards the peripheral (or edge) crossbar 910 may be routed through any available peripheral (or edge) crossbar 910, for example, with a round robin distribution or randomly distributed routing to any available switches 905, 910. For routing from the central (or center) crossbar 905 through the peripheral (or edge) crossbar 910 and/or the endpoint crossbar 930 to any endpoint (or destination), which may be the HTF200 or HTP300, the memory controller 120, the host processor 110, etc., an identifier or address (e.g., virtual) of the endpoint (or destination), which is typically an address or identifier with five fields: (a) a first (or horizontal) identifier; (b) a second (or vertical) identifier; (c) a third edge identifier; (d) a fourth group identifier; and (e) a fifth endpoint identifier. The first (or horizontal) identifier and the second (or vertical) identifier are used for routing to the correct destination center, the edge identifier is used for routing to the edge (of the four available edges) of the selected chip or chiplet, the group identifier is used for routing to the selected communication interface that can be at the selected edge, and the endpoint identifier is used for any additional routing, such as through endpoint crossbar 930 or mesh 920. To save power, any of the various central (or center) crossbar 905, peripheral (or edge) crossbar 910, and endpoint crossbar 930 may be power-gated or clock-gated to turn off the various switches when routing requirements may be lower and required capacity may be less, and to turn on the various switches when routing requirements may be higher and required capacity may be greater. Additional aspects of the first interconnection network 150 are discussed in more detail below with reference to fig. 30.

The packets of the first interconnection network 150 consist of fixed generic packet headers plus variable sized packet payloads. A single packet header is required for each packet and is used to route the packet from a source component to a destination component within system 100. The size of the payload may vary depending on the type of request or response packet. Table 1 shows information contained in a general header of a packet of the first interconnection network 150, table 2 shows information contained in a read request packet of the first interconnection network 150, and table 3 shows information contained in a read response packet of the first interconnection network 150 for a 16B read having a 8B flit size.

Table 1:

generic packet header for first interconnect network 150

Table 2:

read request packet for first interconnect network 150

Table 3:

16B read response packet (with 8B flit) for first interconnect network 150

The HTF circuit 200, in turn, typically includes a plurality of HTF circuit groups 205, wherein each HTF circuit group 205 is coupled to the first interconnection network 150 for data packet communication. Each HTF circuit group 205 may operate independently of each of the other HTF circuit groups 205. Each HTF circuit group 205, in turn, includes a plurality of arrays of HTF reconfigurable computing circuits 210, referred to herein equivalently as "tiles" 210, and a second interconnection network 250. Tiles 210 are embedded or otherwise coupled to a second interconnection network 250, second interconnection network 250 comprising two different types of networks discussed in more detail below. In the representative embodiment, each HTF circuit group 205 further includes a memory interface 215, an optional first network interface 220 (which provides an interface for coupling to the first interconnection network 150), and an HTF scheduling interface 225. Each of the memory interface 215, the HTF scheduling interface 225, and the optional first network interface 220 may be implemented using any suitable circuitry, such as one or more state machine circuits, to perform the functionality described in more detail below.

The HTP300 is a barrel multithreaded processor designed to perform well in applications with a high degree of parallelism for sparse data sets (i.e., applications with minimal data reuse). The HTP300 is based on an open source RISC-V processor and executes in user mode. The HTP300 includes more RISC-V user mode instructions, plus a set of custom instructions that allow threads to manage, send and receive events to/from other HTPs 300, HTF circuit 200 and one or more host processors 110, as well as instructions for efficiently accessing memory 125.

Sparse data sets typically produce poor cache hit rates. The HTP300, which has many threads per HTP processor core 705, allows some threads to wait for a response from the memory 125 while other threads continue executing instructions. This computational style allows memory 125 to have latency and allows for high persistence of executed instructions per clock. The event mechanism allows threads from many HTP cores 705 to communicate in an efficient manner. A thread suspends execution of instructions while waiting for a response or event message from the memory 125, allowing other threads to use the instruction execution resources. The HTP300 is self-scheduling and event driven, allowing threads to efficiently create, destroy, and communicate with other threads. The HTP300 is discussed in more detail below with reference to fig. 25-33.

II, mixed thread:

the hybrid threads of system 100 allow computing tasks to transition from host processor 110 to HTP300 and/or HTF200 on one node and then to HTP300 or HTF200 on a different node, if any. During the transition work of this entire sequence from one computing element to another, all aspects are handled completely in user space. Additionally, the transition of a computing task from an HTP300 to another HTP300 or HTF200 may be made by executing a single HTP300 instruction and without reference to the memory 125. This extremely light thread management mechanism allows an application to quickly create a large number of threads to handle a parallelizable core of the application and then rejoin when the core completes. The computing elements of the HTP300 and HTF200 handle computing tasks in a very different way (RISC-V instruction execution versus dataflow), but they both support the hybrid threading approach and can represent application seamless interactions.

The job descriptor packet is used to start the job on the HTP300 and HTF circuit 200. The receipt of the work descriptor packet by the HTP300 and/or HTF200 constitutes an "event" in the HTP300 and/or HTF200 that will trigger the hardware-based self-scheduling and subsequent execution of the associated function or work (referred to as a thread of execution) without requiring further access to the main memory 125. Once a thread starts, it executes instructions until either a thread return instruction is executed (via HTP 300) or a return message is generated (via HTF 200). The thread return instruction sends a return job descriptor packet to the original caller.

For purposes of this disclosure, at a high or general level, a job descriptor package includes (e.g., without limitation): (1) information required to route the job descriptor packet to its destination; (2) initialize the thread context of the HTP300 and/or the HTF circuit 200, such as a program count (e.g., as a 64-bit address) so that a stored instruction (stored in the instruction cache 740 of fig. 28 or the first instruction RAM315 of fig. 9) begins thread execution; (3) any arguments or addresses in the first memory 125 that obtain arguments or other information to be used for thread execution; and (4) a return address for transmitting the calculation result. Although "job descriptor packets" are referred to in the singular as mentioned above, such job descriptor packets are actually divided into a plurality of packets for transmission over the first interconnection network 150, namely addressing (or routing) data packets 158 and one or more data packets 162. There may be many different kinds of job descriptor packages depending on the operation or instruction to be performed, with many examples shown and discussed below. Instruction cache 740 and/or first instruction RAM315 may have been filled prior to any execution, such as in an initial system 100 configuration. Generally, for the HTF circuit 200, the job descriptor call packet will also have similar information, such as addressing, payload (e.g., configuration, argument values, etc.), call Identifier (ID), and return information (e.g., for providing results to the endpoint), among other information, as discussed in more detail below.

At a high or general level, and as discussed in more detail below, the host processor 110 or HTP300 may initiate a thread on another HTP300 or HTF200 by sending a call work descriptor packet. The call information contains the destination node, the incoming instruction address of the call, and up to four 64-bit argument values. Each HTP300 is initialized with a pool of stacks and background structures. These structures reside in user space. When the HTP300 receives a call, it selects the stack and background structure from the free pool. Next, the HTP300 initializes the new thread with the call information and the stack structure address. At this point, the initialized thread is in the active thread queue to begin execution. The step of initiating a thread on the HTP300 may be implemented as a hardware state machine (as opposed to executing instructions) to maximize thread creation throughput. There are similar hardware-based methods for initiating work on the HTF200, also discussed below.

Once a thread is in the active thread queue on HTP300, it is selected to execute instructions. Eventually, the thread will complete its computational task. At this point, the HTP300 will send a return message back to the calling processor by executing a single custom RISC-V send return instruction. Sending a return is similar to sending a call. The instruction frees the stack and background structures and sends a maximum of four 64-bit parameters back to the calling processor. Invoking the HTP300 executes the receive return custom RISC-V instruction to receive the return. The HTP call handler 300 copies the return argument into the ISA visible register for access by the executing thread. The original send call contains the information needed for the invoked HTP300 to know that it returns the send location. The information consists of the source HTP300 and the thread ID of the calling thread.

The HTP300 has three options for sending a work task to another HTP300 or HTF200 computing element. These options are for performing calls, forking or passing, as shown in fig. 39 to 41:

(a) the call 901 initiates a computing task on the remote HTP300 or HTF200 and suspends other instruction execution until a return 902 is received. The return information passed to the remote computing element is used by the remote computing task when it has completed and is ready to return.

(b) The fork (903) initiates a computational task on the remote HTP300 or HTF200 and continues executing instructions. A single thread may initiate many computing tasks on a remote HTP300 or HTF200 computing element using a send fork mechanism. The original thread must wait until a return has been received from each of the forked threads before sending its return (902). The return information passed to the remote computing element is used by the remote computing task when it has completed and is ready to return.

(c) The transfer (904) initiates a computing task on the remote HTP300 or HTF200 and terminates the original thread. The return (902) information passed to the remote compute element is the return information from the call, fork, or pass that initiated the current thread. The issue fork (903) contains information back to the thread executing the deliver fork instruction on the first HTP 300. The issue transfer (Xfer) performed on the second HTP300 contains information that is returned to the thread executing the issue fork instruction on the first HTP 300. Basically, the send transfer only transfers the return information provided at the time of initiation. Finally, the thread executing on the third or fourth HTP300 that sent the return uses the return information it received to determine the destination of the return.

Although calls, forking, and passing for communication between HTPs 300 are shown in fig. 39-41, HTPs 300 may also send similar work descriptor packets to HTF200, as shown for the call chain instance in fig. 42.

Threads may access private memory on local nodes and shared memory on local and remote nodes by referencing virtual address spaces. The HTP300 thread will primarily use the provided inbound call arguments and private memory stack to manipulate data structures stored in shared memory. Similarly, the HTF200 threads will use inbound call arguments and fabric memory to manipulate data structures stored in the shared memory.

When a thread is created, HTP300 threads are typically provided with up to four call arguments and a stack. The arguments are located in registers (memory 715, discussed below) and the stack is located in the node private memory. By using a standard stack frame based calling method, a thread will typically call to use stacking locally to the thread private variable and to the HTP 300. The HTP300 thread also accesses the entire partitioned global memory of the application. Application data structures are expected to be primarily allocated from a partitioned global address space to allow all node compute elements to participate in computations with direct load/store access.

Each HTP300 thread has a background block provided at the time the thread is launched. The context block provides a location in memory 125 where thread context can be saved when needed. Typically, this is done for debugging purposes, and this will also occur if more threads are created than are available to process their hardware resources. The user may limit the number of active threads to prevent a thread from constantly writing state to its memory-based background structure (unless for possible visibility debugging).

Typically, when a thread is created, a maximum of four call arguments are also provided to the HTF200 thread. The arguments are in an in-fabric memory structure for access by data stream computations. The in-fabric memory is also used for thread private variables. The HTF200 thread may access the entire partitioned global memory of the application.

The computing elements of system 100 have different capabilities that are each uniquely suited to a particular computing task. Host processor 110 (internal or external to the device) is designed for the lowest possible latency when executing a single thread. The HTP300 is optimized for parallel execution of a larger set of threads to provide the highest execution throughput. The HTF200 is optimized for the extremely high performance of the dataflow style core. Computing elements have been constructed to transfer computing tasks from one element to the next with extreme efficiency to execute the computing core as efficiently as possible. FIG. 42 illustrates a representative call chain usage instance for a hybrid thread with each compute element and illustrates a traditional hierarchical usage model similar to a simulation. High-throughput data-intensive applications may use different usage models that are oriented towards several independent streams.

The entire application begins executing on host processor 110 (internal or external). Host processor 110 will typically make a set of nested calls when it decides to take the appropriate action based on the input parameters. Eventually, the application reaches the computational phase of the program. The computation phase may be best suited for execution on host processor 110, or for accelerated execution by invoking HTP300 and/or HTF200 computing elements. FIG. 42 shows host processor 110 executing multiple calls to HTP300 (901). Each HTP300 will typically fork 903 several threads to perform its computational tasks. Each thread may perform computations (integer and floating point), access memory (read, write), and pass thread execution to another HTP300 or HTF200 (on the same or a remote node), such as through a call to the HTF200 (901). The ability to move execution of a core to another node may be advantageous in that it allows computing tasks to be performed close to the memory that needs to be accessed. Performing work on the appropriate node devices may greatly reduce inter-node memory traffic and speed up application execution. It should be noted that in representative embodiments, the HTF200 does not make calls to the host processor 110 or the HTP300, and in special cases (i.e., when limited to compile time) only makes calls to the HTF 200.

The host processor 110 can initiate a thread on either the HTP300 or HTF200 on the local node. For an external host processor 110, a local node is a node connected to a host through a PCIe or other communication interface 130. For an internal host processor 110, a local node is a node in which the host processor 110 is embedded. A description of how host processor 110 on HTP core 705 initiates work is presented. A similar method is used to initiate work on the HTF 200.

Host processor 110 initiates work on HTP core 705 by writing work descriptors to dispatch queue 105 of Host Interface (HIF) 115. The dispatch queue 105 is located in private memory so that the host processor 110 writes to the buffered data to optimize the performance of the host processor 110. The entry size in the scheduling queue 105 is typically 64 bytes so that there is enough space for remote invocation information and up to four 64-bit parameters. It should be noted that in the representative embodiment, there is one dispatch queue 105 per application per node. For a 64 node system, there will be 64 operating system instances. Each OS instance will have one or more processes, each with their own scheduling queue 105.

HIF 115 monitors the write pointer of dispatch queue 105 to determine when to insert an entry. When a new entry exists, HIF 115 verifies that there is room in the return queue for a 64 byte return message for host processor 110. This check is needed to ensure that the state of the completed call is not discarded due to the lack of return queue space. Assuming return space exists, HIF 115 reads the call entry from dispatch queue 105 and forwards it as a work descriptor packet onto HTP300 or HTF 200. The HTP300 or HTF200 then processes the job descriptor packet, as discussed in more detail below, and generates a return packet.

The entire process of the host processor 110 starting a new thread on the HTP300 or HTF200 requires the call information to be staged through the dispatch queue 105 (64 bytes written to the queue, 64 bytes read from the queue), but there is no other access to the DRAM memory. The buffering of the call information by the dispatch queue 105 provides the required backpressure mechanism. If the dispatch queue 105 becomes full, the host processor 110 will pause until progress is made and an entry of the dispatch queue 105 becomes available.

Return packets are transmitted to HIF 115 through first interconnection network 150. HIF 115 writes the return packet to an available return queue entry. The host processor 110 will typically periodically poll the return queue to complete the call and obtain the status of any returns. Note that the return queue is accessed in FIFO order. If the return must match a particular call, then the runtime library can be used to perform this ordering. For many applications it is sufficient to know that all returns have been received and that the next phase of the application can start.

Hybrid thread fabric 200:

by way of overview, the HTF circuit 200 is a coarse-grained reconfigurable computing fabric that includes interconnect compute tiles 210. The tiles 210 are interconnected with a synchronous fabric, referred to as a synchronous mesh communication network 275, allowing data to pass from one tile 210 to another tile 210 without queuing. This synchronous mesh communication network 275 allows many tiles 210 to be pipelined together to produce a continuous stream of data through arithmetic operations, and each such pipeline of tiles 210 connected through the synchronous mesh communication network 275 for the execution of one or more computing threads is referred to herein as a "synchronization domain," which may have a series connection, a parallel connection, and possibly a branching connection. The first tile 210 of the synchronization domain is referred to herein as the "base" tile 210.

The tiles 210 are also interconnected with an asynchronous fabric called an asynchronous packet network 265, allowing bridging of the synchronous domains of computation through asynchronous operations, where in a representative embodiment all packets on the asynchronous packet network 265 are able to be transmitted in a single clock cycle. These asynchronous operations include initiating synchronous domain operations, passing data from one synchronous domain to another, accessing system memory 125 (reads and writes), and performing branch and loop constructs. Together, synchronous and asynchronous organization allow the tile 210 to efficiently perform high-level language constructs. Asynchronous packet network 265 differs from first interconnection network 150 in a number of ways, including (for example and without limitation) requiring less addressing, being a single channel, queuing with depth-based backpressure, and using packed data operands, such as data paths having 128 bits. Note that the internal data path for each tile 210 is also 128 bits, again by way of example and not limitation. For example, but not limiting of, an instance of a synchronous domain and an instance of a synchronous domain communicating with each other over an asynchronous packet network 265 are shown in fig. 16, 18, 20.

In the representative embodiment, in most cases, thread (e.g., core) execution and control signaling are separated between the two different networks, where thread execution occurs using the synchronous mesh communication network 275 to form multiple synchronous domains for the various tiles 210, and control signaling occurs using messaging packets transmitted between the various tiles 210 over the asynchronous packet network 265. For example, the plurality of configurable circuits are configured to form a plurality of synchronous domains using the synchronous mesh communication network 275 to perform a plurality of computations, and the plurality of configurable circuits are further configured to generate and transmit a plurality of control messages over the asynchronous packet network 265, wherein the plurality of control messages include, for example and without limitation, one or more completion messages and continuation messages.

In the exemplary embodiment, second interconnection network 250 generally includes two different types of networks, each type of network providing inter-tile 210 and data communication, a first asynchronous packet network 265 and a second synchronous mesh communication network 275 overlaid or combined therewith, as shown in FIGS. 6 and 7. The asynchronous packet network 265 includes: a plurality of AF switches 260, typically implemented as crossbar switches (which may also have (for example but not limited to) no additional or optional Clos or folded Clos configurations); and a plurality of communication lines (or wires) 280, 285 that connect AF switch 260 to tile 210, thereby providing packet communication between tile 210 and the other illustrated components discussed below. The synchronous mesh communication network 275 provides multiple direct (i.e., no switches, point-to-point) connections between tiles 210 over communication lines (or wires) 270, which may be buffered by registers at the input and output of tiles 210, but otherwise does not require queuing between tiles 210 and does not require packet formation. (in FIG. 6, to better illustrate the overlay of these two networks, its tile 210 is embedded in a second interconnecting network 250, tile 210 is represented as the vertex of synchronous mesh communication network 275, and AF switch 260 is shown as "Xs," as indicated.)

Referring to fig. 8, a tile 210 includes one or more configurable computing circuits 155, control circuitry 145, one or more memories 325, configuration memory (e.g., RAM)160, synchronous network inputs 135 (coupled to a synchronous mesh communication network 275), synchronous network outputs 170 (also coupled to the synchronous mesh communication network 275), asynchronous (packet) network inputs 140 (coupled to an asynchronous packet network 265), and asynchronous (packet) network outputs 165 (also coupled to the asynchronous packet network 265). Each of these different components is shown coupled to each other by a bus 180, 185, in various combinations, as shown. Those skilled in the electronic arts will recognize that fewer or more components, as well as any of a variety of coupling combinations, may be included in tile 210, all of which are considered equally valid and considered within the scope of this disclosure.

A representative example of each of these various components is shown and discussed below with reference to fig. 9. For example, in a representative embodiment, the one or more configurable calculation circuits 155 are embodied as multiply and shift operation circuits ("MS Op") 305 and arithmetic, logic and bit operation circuits ("ALB Op") 310 and have associated configuration capabilities, such as (for example, but not limited to) through an intermediate multiplexer 365 and an associated register, such as register 312. Also in representative embodiments, the one or more configurable computing circuits 155 may include, again by way of example and not limitation, a write mask generator 375 and conditional (branch) logic circuitry 370. Also in a representative embodiment, control circuitry 145 can include, for example and without limitation, memory control circuitry 330, thread control circuitry 335, and control registers 340, such as those shown for tile 210A. Continuing the example, synchronous network input 135 may include input registers 350 and input multiplexer 355, synchronous network output 170 may include output registers 350 and output multiplexer 395, asynchronous (packet) network input 140 may include AF input queue 360, and asynchronous (packet) network output 165 may include AF output queue 390, and may also contain or share AF message state machine 345.

Notably, as discussed in more detail below, the configuration memory (e.g., RAM)160 includes configuration circuitry (e.g., configuration memory multiplexer 372) and two different configuration memories that perform different configuration functions, namely a first instruction RAM315 (which is used to configure the internal data paths of tiles 210) and a second instruction and instruction index memory (RAM)320, referred to herein as "spoke" RAM320 (which is used for a number of purposes, including configuring portions of tiles 210 that are independent of the current instruction, selecting instructions for the current instruction and the next tile 210, and selecting a master synchronization input, etc., all of which are discussed in more detail below).

As shown in fig. 8 and 9, communication line (or wire) 270 is shown as communication lines (or wires) 270A and 270B, such that communication line (or wire) 270A is the "input" (input communication line (or wire)) that feeds data into input register 350, and communication line (or wire) 270B is the "output" (output communication line (or wire)) that shifts data out of output register 380. In a representative embodiment, there are multiple sets or buses of communication lines (or wires) 270 to and from each tile 210, from and to each adjacent tile (e.g., uplink, downlink, left-and right-link of the synchronous mesh communication network 275), and from and to other components to distribute various signals, such as data write masks, stop signals, and instructions or instruction indices provided from one tile 210 to another tile 210, as discussed in more detail below. Alternatively and not separately shown, there may also be various dedicated communication lines, for example for asserting a stop signal so that a stop signal generated from any tile 210 in the HTF circuit group 205 may be instantaneously received by all other tiles 210 in the HTF circuit group 205 for a limited number of clock cycles.

It should be noted that there are various fields in the communication lines of the various groups or buses that form the synchronous mesh communication network 275. For example, fig. 8 and 9 show four bus incoming and outgoing communication lines (or conductors) 270A and 270B, respectively. Each of these sets of communication lines (or wires) 270A and 270B may carry different information, such as data, instruction index, control information, and thread information (e.g., TID, XID, loop-dependent information, write mask bits for selecting valid bits, etc.). One of the inputs 270A may also be represented as a primary synchronization input, including, for example and without limitation, an input (feedback from an output) internal to tile 210 that may change for each time slice of tile 210, and may have data of the instruction index of that tile 210 of the synchronization domain, as discussed in more detail below.

Additionally, as discussed in more detail below, for any input received over the synchronous mesh communication network 275 and saved in one or more input registers 350 (of the synchronous network input 135), each tile 210 may pass the input directly to one or more output registers 380 (of the synchronous network output 170) to be output (typically on a single clock cycle) to another location of the synchronous mesh communication network 275, thereby allowing the first tile 210 to communicate with any other third tile 210 within the HTF circuit group 205 through one or more intermediate second tiles 210. This synchronous mesh communication network 275 enables the configuration (and reconfiguration) of statically scheduled synchronous pipelines between tiles 210 such that data processing is completed within a fixed period of time once a thread begins as a synchronous domain along a selected data path between tiles 210. In addition, the synchronous mesh communication network 275 serves to minimize the number of accesses to the memory 125 that are required, since accesses to the memory 125 may not be required to complete the calculations, since the threads will execute along the selected data path between the tiles 210.

In asynchronous packet network 265, each AF switch 260 is typically coupled to multiple tiles 210 and one or more other AF switches 260 through communication lines (or wires) 280. In addition, the one or more selected AF switches 260 are also coupled (via communication lines (or wires) 285) to one or more of the memory interface 215, the optional first network interface 220, and the HTF scheduling interface 225. As shown, for example and without limitation, the HTF circuit group 205 includes a single HTF dispatch interface 225, two memory interfaces 215, and two optional first network interfaces 220. Also as shown, for example, but not limiting of, one of the AF switches 260 is further coupled to the memory interface 215, the optional first network interface 220, and the HTF scheduling interface 225, and another of the AF switches 260 is further coupled to the memory interface 215 and the optional first network interface 220, in addition to being coupled to the other AF switches 260.

In accordance with selected embodiments, each of the memory interface 215 and the HTF scheduling interface 225 may also be directly connected to the first interconnection network 150 and capable of receiving, generating, and transmitting data packets through the first interconnection network 150 and the asynchronous packet network 265, the first network interface 220 not being used for the HTF circuit group 205 or included in the HTF circuit group 205. For example, the HTF scheduling interface 225 may be used by any of the various tiles 210 to transfer data packets to and from the first interconnection network 150. In other embodiments, either of the memory interface 215 and the HTF scheduling interface 225 may utilize the first network interface 220 to receive, generate, and transmit data packets through the first interconnection network 150 to provide the additional addressing required by the first interconnection network 150 using the first network interface 220.

Those skilled in the electronic arts will recognize that the connections between AF switch 260, tile 210, optional first network interface 220, memory interface 215, and HTF scheduling interface 225 may be made in any selected combination, with any selected number of components and all such component selections and combinations being treated equally and considered within the scope of this disclosure. Those skilled in the electronic arts will recognize that the HTF circuit 200 need not be divided into multiple HTF groups 205, but rather a conceptual division is provided to describe the various components and the connections between the various components. For example, although the HTF circuit group 205 is shown with sixteen tiles 210 with four AF switches 260, a single HTF scheduling interface 225, two memory interfaces 215, and two first network interfaces 220 (optional), any of more or fewer of these components may be included in either the HTF circuit group 205 or the HTF circuit 200, or both, and as described in more detail below, for any selected embodiment, the HTF circuit group 205 may be partitioned to change the number and types of components that may be active (e.g., powered on and running) at any selected time.

The synchronous mesh communication network 275 allows multiple tiles 210 to be pipelined without the need for data queuing. All of the tiles 210 participating in the synchronization domain act as a single pipelined data path. The first tile in the series of tiles 210 that forms a single pipelined datapath is referred to herein as the "base" tile 210 of the synchronization domain, and such base tile 210 initiates a worker thread through the pipelined tiles 210. Base tile 210 is responsible for starting working at a predefined pace, referred to herein as a "spoke count". As an example, if the spoke count is three, base tile 210 may initiate work every two clocks. It should also be noted that the computations within each tile 210 may also be pipelined, such that portions of different instructions may be executed while other instructions are being executed, e.g., while a current operation is being executed, data for a next operation is input.

Each of the tiles 210, the memory interface 215, and the HTF scheduling interface 225 has a different or unique address (e.g., as a 5-bit wide endpoint ID) within any selected HTF circuit group 205 as a destination or endpoint. For example, tile 210 may have endpoint IDs 0-15, memory interface 215(0 and 1) may have endpoint IDs 20 and 21, and HTF schedule interface 225 may have endpoint ID 18 (where no address is provided to optional first network interface 220 unless it is included in selected embodiments). The HTF scheduling interface 225 receives a data packet containing work to be performed by one or more of the tiles 210, referred to as a work descriptor packet, which has been configured for various operations, as discussed in more detail below. The job descriptor packet will have one or more arguments which are then provided or distributed by the HTF scheduling interface 225 to the various tiles as packets or messages (AF messages) that are transmitted through the AF switch 260 to the selected addressed tile 210, and in addition, these arguments will typically include an identification of the area in the tile memory 325 where the data (arguments) are stored and a thread identifier ("ID") for tracking and identifying the associated computations and their completion.

Messages are routed from a source endpoint through the asynchronous packet network 265 to a destination endpoint. Messages from different sources to the same destination take different paths and may experience different levels of congestion. Messages may arrive in an order different from their order of issuance. The message passing mechanism is built to function properly with a non-deterministic order of arrival.

Fig. 13 is a block diagram of a representative embodiment of the memory interface 215. Referring to fig. 13, each memory interface 215 includes a state machine (and other logic circuitry) 480, one or more registers 485, and optionally one or more queues 474. The state machine 480 receives, generates, and transmits data packets over the asynchronous packet network 265 and the first interconnection network 150. Registers 485 store addressing information that is physically addressed within a given node, such as the virtual address of tile 210, and various tables that translate the virtual address to a physical address. An optional queue 474 stores messages waiting to be transmitted over the first interconnection network 150 and/or the asynchronous packet network 265.

The memory interface 215 allows the tiles 210 within the HTF circuit group 205 to request system memory 125, such as DRAM memory. The types of memory requests supported by the memory interface 215 are load, store, and atomic. From the perspective of the memory interface 215, the load sends the address to the memory 125 and returns the data. The write sends both the address and the data to the memory 125 and returns a completion message. The atomic operation sends the address and data to memory 125 and returns the data. It should be noted that only the atoms that receive data from memory (i.e., fetch and increment) will be processed by the memory interface 215 as load requests. All memory interface 215 operations require a single 64-bit virtual address. The data size of the operation may vary from a single byte to 64 bytes. Larger data payload sizes may be more efficient for the device and may be used; however, the data payload size will be controlled by the ability of the high-level language compiler to detect access to large blocks of data.

Fig. 14 is a block diagram of a representative embodiment of the HTF scheduling interface 225. Referring to fig. 14, the HTF dispatch interface 225 includes a state machine (and other logic circuitry) 470, one or more registers 475, and one or more dispatch queues 472. The state machine 470 receives, generates, and transmits data packets over the asynchronous packet network 265 and the first interconnection network 150. Register 475 stores addressing information, such as the virtual address of tile 210, as well as various tables that track the configuration and workload distributed to the various tiles, as discussed in more detail below. The dispatch queue 472 stores messages waiting to be transmitted over the first interconnection network 150 and/or the asynchronous packet network 265.

As mentioned above, the HTF dispatch interface 225 receives a job descriptor call packet (message), for example, from the host interface 115 through the first interconnection network 150. The job descriptor call package will have various information such as payload (e.g., configuration, argument values, etc.), call Identifier (ID), and return information (e.g., for providing results to the endpoint). The HTF schedule interface 225 will create: various AF data messages for transmission over asynchronous packet network 265 to tile 210, including writing data into memory 325, said tile 210 to be a base tile 210 (base tile ID, for transmitting AF completion messages); and a thread ID (thread identifier or "TID") and a continue message (e.g., with a completion and other count for each TID) will be sent to base tile 210 so that base tile 210 can begin execution once enough completion messages are received. The HTF scheduling interface 225 maintains various tables in registers 475 to track, for each thread ID and XID, what content has been transferred to which tile 210. Upon completion of the result generation or execution, the HTF scheduling interface 225 will receive either an AF data message (indicating completion and having data) or an AF completion message (indicating completion but not having data). The HTF dispatch interface 225 also keeps (in register 475) respective counts of the number of completion and data messages it needs to receive to know that core execution has completed, and will then assemble and transmit a work descriptor return packet over the first interconnection network 150 with the resulting data, the call ID, the return information (e.g., the address of the requestor), and release the TID. Additional features and functionality of the HTF scheduling interface 225 are described in more detail below.

It should be noted that as mentioned above, multiple levels (or types) of TIDs may be used, and are often used. For example, the HTF scheduling interface 225 allocates a TID of the first type from the TID pool, which is transmitted to the base tile 210. The base tile 210 may in turn be allocated additional TIDs, such as TIDs of the second and third types, such as (for example, but not limited to) for tracking threads for loops and nested loops. These different TIDs can then also be used to access variables private to a given cycle. For example, a first type of TID may be used for the outer loop, and second and third types of TIDs may be used to track iterations of the nested loop.

It should also be noted that separate transaction IDs are utilized to track individual memory requests through the first interconnection network 150.

Fig. 15 is a block diagram of a representative embodiment of an optional first network interface. Referring to fig. 15, when included, each first network interface 220 includes a state machine (and other logic circuitry) 490 and one or more registers 495. The state machine 490 receives, generates and transmits data packets over the asynchronous packet network 265 and the first interconnection network 150. Registers 495 store addressing information that is physically addressed within a given node, such as the virtual address of tile 210, and various tables that translate the virtual address to a physical address.

Referring again to fig. 9, a representative HTF reconfigurable computing circuit (tile) 210A includes at least one multiply and shift operation circuit ("MS Op") 305, at least one arithmetic, logical and bit operation circuit ("ALB Op") 310, a first instruction RAM315, a second instruction (and index) RAM320, referred to herein as a "spoke" RAM320, one or more tile memory circuits (or memories) 325 (shown as memory "0" 325A, memory "1" 325B through memory "N" 325C, and individually and collectively referred to as memory 325 or tile memory 325). In addition, as previously mentioned, the representative tile 210A also typically includes input registers 350 and output registers 380 coupled to the synchronous mesh communication network 275 via communication lines (or wires) 270A, 270B, and AF input queues 360 and AF output queues 390 coupled to the AF switches 260 via communication lines (or wires) 280 of the asynchronous packet network 265. Control circuitry 145 is also typically included in tile 210, such as memory control circuitry 330, thread control circuitry 335, and control registers 340 as shown for tile 210A. The AF message state machine 345 is also typically included in the tile 210 for decoding and for preparing (assembling) data packets received from or provided to the asynchronous packet network 265. As part of the configurability of tile 210, one or more multiplexers, shown as input multiplexer 355, output multiplexer 395, and one or more intermediate multiplexers 365, are typically included for selecting inputs to MS Op 305 and ALBOp 310. Optionally, other components may also be included in tile 210, such as conditional (branch) logic circuitry 370, write mask generator 375, and flow control circuitry 385 (which is shown as being included as part of AF output queue 390 and may be equivalently provided as a separate flow control circuit). The capabilities of the MS Op 305 and the ALB Op310 are described in more detail below.

The synchronous mesh communication network 275 passes the information needed for the synchronous domains to the functions. The synchronous mesh communication network 275 contains the fields specified below. In addition, many parameters for these fields are also stored in control register 340 and assigned to threads to be executed in the synchronization domain formed by the plurality of tiles 210. The designated fields of the synchronous mesh communication network 275 include:

1. data: typically 64 bits wide and includes calculated data that is passed from one tile 210 to the next tile 210 in the synchronization field.

2. Instruction RAM315 address: abbreviated as "INSTR," typically has a field width of 8 bits and includes the instruction RAM315 address for the next tile 210. Base tile 210 specifies the instruction RAM315 address of the first tile 210 in the domain. Subsequent tiles 210 may pass unmodified instructions or may conditionally alter the instructions of the next tile 210 to allow conditional execution (i.e., if-then-else or switch statements), as described in more detail below.

3. Thread identifier: referred to herein as a "TID," typically has a field width of 8 bits, and includes a unique identifier of the thread of the core, with a predetermined number of TIDs ("TID pools") stored in control register 340 and possibly available to the thread if not used by another thread. The TID is allocated at the base tile 210 of the synchronization domain and can be used as a read index to tile store 325. TIDs may be passed from one synchronous domain to another through asynchronous packet network 265. Because of the limited number of TIDs available for use, to perform other functions or calculations, eventually TIDs should be freed back to the TID pool of allocated base tiles for subsequent reuse. The release is accomplished using asynchronous fabric messages transmitted over the asynchronous packet network 265.

4. The transmission identifier is: referred to as an "XID," typically has a field width of 8 bits and includes a unique identifier for passing data from one synchronization domain to another, where a predetermined number of XIDs ("XID pools") are stored in control register 340 and are potentially available to a thread (if not used by another thread). The transfer may be a direct write of data from one domain to another, as "XID WR," or it may be the result of a read of the memory 125 (as "XID RD"), where the source domain sends a virtual address to the memory 125 and the destination domain receives the memory read data. XID _ WR is assigned at base tile 210 of the source domain. The XID _ WR in the source domain becomes XID _ RD in the destination domain. XID _ WR may be used as a write index to the tile memory 325 in the destination domain. XID _ RD is used in the destination domain as a read index for tile memory 325. Because the number of XIDs available for use is limited, to perform other functions or calculations, the final XID should be released back to the XID pool of the assigned base tile for subsequent reuse. The destination domain should release the XID by sending an asynchronous message to the base tile 210 of the source domain also via the asynchronous packet network 265.

The synchronous mesh communication network 275 provides data and control information. Control information (INSTR, XID, TID) is used to set the DATA path, and the DATA field may be selected as the source of the configured DATA path. It should be noted that the control field is required much earlier than the data field (to configure the data path). To minimize synchronous domain pipeline delay through tile 210, control information arrives at tile 210 several clock cycles ahead of data.

A highly innovative feature of the architecture of the HTF circuit 200 and its composite HTF circuit group 205 and their composite tiles 210 is the use of two different configuration RAMs, namely an instruction RAM315 for data path configuration and a spoke RAM320 for a number of other functions, including configuring portions of the tiles 210 independent of any selected or given data path, selecting data path instructions from the instruction RAM315, selecting a master synchronization input (out of the available inputs 270A) for each clock cycle, and so forth. As discussed in more detail below, this novel use of instruction RAM315 and independent spoke RAM320 enables dynamic self-configuration and self-reconfiguration, waiting, of the HTF circuit group 205 and the HTF circuit 200 as a whole.

Each tile has an instruction RAM315 containing configuration information that sets the datapath of tile 210 for a particular operation, i.e., a datapath instruction that determines, for example, whether multiplication, shift, addition, etc., will be performed in a given time slice of tile 210 and which data (e.g., data from memory 325, or data from input register 350) to use. Instruction RAM315 has multiple entries that allow tile 210 to be time sliced to perform multiple different operations in a pipelined synchronization domain, where representative pipeline sections 304, 306, 307, 308, and 309 of tile 210 are shown in FIG. 9. Any given instruction may also specify which inputs 270A will have data and/or control information for use by the instruction. In addition, each time slice may conditionally execute a different instruction depending on the time slice data dependent conditional operation of the previous tile 210, as discussed with reference to FIG. 24. The number of entries in instruction RAM315 is typically about 256. The number may vary depending on experience gained with migrating cores to the HTF 200.

The instruction set supported should meet the needs of the target application, e.g., an application having data types of 32 and 64 bit integer and floating point values. Additional standby applications such as machine learning, image analysis, and 5G wireless processing may be executed using the HTF 200. This complete set of applications would require 16, 32, and 64 bit floating point and 8, 16, 32, and 64 bit integer data types. The supported instruction set needs to support these data types for load, store, and arithmetic operations. The supported operations require instructions that allow a compiler to efficiently map high-level language sources to tiles 210. In a representative embodiment, tile 210 supports the same instruction set as a standard high performance processor, including Single Instruction Multiple Data (SIMD) instruction variations.

Spoke RAM320 has multiple functions, and in a representative embodiment, one of those functions will be used to configure the portion of tile 210 (time sliced) that is independent of the current instructions of the data path, i.e., the tile 210 configuration saved in spoke RAM320 can be used to configure a constant portion of the configuration of tile 210, e.g., those settings that remain the same tile 210 for different data path instructions. For example, as a select control for input multiplexer 355, spoke RAM320 is used to specify which input of tile 210 (e.g., array input communication line 270A or one of input registers 350) is the master synchronization input for each clock cycle. This is critical because the instruction index of the instruction (from instruction RAM 315) that selects a given time slice of tile 210 and the Thread ID (TID) are provided on the master synchronization input. Thus, even though the actual instruction index provided on the input 270A to a given tile 210 may change (as described with reference to FIG. 24), it does not change which set of inputs 270A will have the selected instruction index, so that any given tile 210 knows in advance what input it will use to receive the selected instruction index, regardless of the instruction specified by the selected instruction index. The configuration stored in spoke RAM320 also typically specifies which outputs 270B are to be used for the selected instruction (or time slice). Spoke RAM320 reads the address input, i.e., the spoke index, from a counter that subtracts a count (modulo) from zero to the spoke count. All tiles 210 within the HTF circuit group 205 should have substantially the same spoke RAM input value at each clock to have proper synchronous domain operation. Spoke RAM320 also stores an instruction index and is also used to select instructions from instruction RAM315 such that for a base tile 210 of a synchronization domain, when the count of spoke RAM320 changes, a series of instructions may be selected for tile 210 to execute. As for subsequent tiles in the synchronization domain, the instruction index may be provided by the previous tile 210 of the synchronization domain. This aspect of spoke RAM320 is also discussed with reference to fig. 24, as spoke RAM320 is highly innovative, enabling dynamic self-configuration and reconfiguration of HTF circuit group 205.

Spoke RAM320 also specifies when synchronization input 270A is written to tile memory 325. This situation may occur where a tile instruction requires multiple inputs and one of the inputs arrives early. An input arriving early may be written to the tile memory 325 and then read from the memory 325 when other inputs arrive. For this case, the tile memory 325 is accessed as a FIFO. The FIFO read and write pointers are stored in the tile memory area ram.

Each tile 210 contains one or more memories 325, and each memory typically has a data path width (64 bits), and for example, will be in the range of 512 to 1024 elements deep. The tile memory 325 is used to store data needed to support data path operations. The stored data may be a constant that is part of the load configured by core group 205 or may be a variable that is calculated as part of the data flow. The tile memory 325 may be written from the synchronous mesh communication network 275 as a result of a data transfer from another synchronous domain or a load operation initiated by another synchronous domain. The tile memory performs reads only by the synchronous datapath instruction.

The tile memory 325 is typically partitioned into regions. The smaller tile memory area RAM stores the information needed for memory area access. Each region represents a different variable in the kernel. The region may store shared variables (i.e., variables that are shared by all execution threads). The scalar shared variable has an exponent value of zero. The shared variable array has a variable exponent value. The region may store a thread private variable indexed by the TID identifier. Variables may be used to pass data from one synchronization domain to the next. For this case, the variable is written using the XID _ WR identifier in the source synchronization domain and read using the XID _ RD identifier in the destination domain. Finally, the region may be used to temporarily store data that tile 210 previously generated in the synchronous data path until other tile data inputs are ready. For this case, the read and write indices are FIFO pointers. The FIFO pointers are stored in the tile memory area RAM.

The tile memory area RAM typically contains the following fields:

1. the region index is higher: which is the upper bit of the tile memory region index. The lower index bits are obtained from the asynchronous fabric message, TID, XID _ WR, or XID _ RD identifier, or from the FIFO read/write index value. The region index upper bits are ORed with the lower index bits to generate the index for tile memory 325.

2. Region SizeW: which is the width of the lower index of the memory region. The size of the memory area is 2^SizeWAnd (4) each element.

3. Area FIFO read index: it is the read index of the memory area acting as a FIFO.

4. Zone FIFO write index: it is the write index of the memory area acting as a FIFO. The tile performs the computing operations of the HTF 200.

The computing operation is performed by configuring a data path within tile 210. There are two functional blocks that perform the entire computation of tile 210: multiply and shift operation circuitry ("MS Op") 305, and arithmetic, logical and bit operation circuitry ("ALB Op") 310. MS Op 305 and ALB Op310 are under the control of instructions from instruction RAM315 and may be configured to perform, for example and without limitation, two pipelined operations, such as multiply and add, or shift and AND operations. (in a representative embodiment, all devices that support HTF200 will have a complete supported instruction set. this will provide binary compatibility for all devices. however, it may be desirable to have a set of base functionality and optional instruction set classes to meet the die size tradeoff. this approach is similar to how the RISC-V instruction set has a base and multiple optional instruction subsets.) As shown in FIG. 9, the outputs of MS Op 305 and ALBL Op310 may be provided to registers 312, or directly to other components, such as output multiplexer 395, conditional logic circuitry 370, and/or write mask generator 375.

Various operations performed by the MS Op 305 include (for example, but not limited to): integer and floating-point multiplication, shifting, passing inputs, signed and unsigned integer multiplication, signed and unsigned right shifting, signed and unsigned left shifting, bit order reversal, permutation (all as floating-point operations), and transformations between integer and floating-point, such as double-precision truncation operations or conversion of floating-point to integer. The various operations performed by the ALB Op310 include (for example, but not limited to): signed and unsigned addition, absolute value, inversion, logical nor, addition and inversion, subtraction a-B, inverse subtraction B-a, signed and unsigned greater or equal, signed and unsigned less or equal, compared (equal or unequal), logical operations ("and", "or", "exclusive or", "nand", "nor", "exclusive or", "nor" (all as floating point operations), and transformations between integers and floating points, such as truncating operations or converting floating points to integers.

The inputs to ALB Op310 and MS Op 305 are from sync tile input 270A (held in register 350), from internal tile memory 325, or from a small constant value provided within instruction RAM 315. Table 4 below shows the tile 210 data path input sources, listing typical inputs to ALB Op310 and MS Op 305.

Table 4:

each output 270B of the tiles 210 is individually enabled as part of the communication lines 270 of the synchronous mesh communication network 275, thereby enabling clock gating of the disabled outputs. The output of the ALB Op310 can be sent to multiple destinations as shown in table 5.

Table 5:

name of destination	Destination description
		SYNC_U	Synchronous mesh communication network 275 uplink
SYNC_D	Synchronous mesh communication network 275 downlink
		SYNC_L	Synchronous mesh communication network 275 left link
SYNC_R	Synchronous mesh communication network 275 right link
		WRMEM0_Z	Memory
0 is written. The value zero is used as an index to write to an area of the memory 325.
	WRMEM0_C	Memory	0 is written. The instruction constant field is used as an index to write to a region of memory 325.
WRMEM0_T				Memory	0 is written. The TID value is used as an index to write to an area of memory 325.

At a high level, and as a representative example, all without limitation, the general operation of tile 210 is as follows. When configuring the load into the system, the synchronous mesh communication network 275 and the synchronous domains of the various tiles 210 are all scheduled as part of the program compilation. Unless paused or stopped, tile 210 may perform its operations when all required inputs are ready, e.g., data variables are in input registers 350 or memory 325, and available to be read or retrieved from the registers or memory and used, as described in more detail below. In representative embodiments, each pipeline stage may operate in a single clock cycle, but in other representative embodiments, additional clock cycles may be utilized per pipeline stage. In the first pipeline stage 304, data is input, for example, into the AF input queue 360 and input register 350, and optionally directly into the memory 325. In the next pipeline stage 306, for example, the AF message is decoded by AF state machine 345 and moved into memory 325; AF state machine 345 reads data from memory 325 or receives and generates data packets from output multiplexer 395 for transmission over asynchronous packet network 265; the data in the input registers 350 is moved into the memory 325 or selected as operand data (using the input multiplexers 355 and the intermediate multiplexers 365) or passed directly to the output registers 380 for output on the synchronous mesh communication network 275. In one or more of the subsequent pipeline stages 307 and 308, the computations are performed by the ALB Op310 and/or the MS Op 305, a write mask may be generated by the write mask generator 375, and an instruction (or instruction index) may be selected based on the test conditions in the conditional (branch) logic circuitry 370. In the next pipeline stage 309, the output is selected using the output multiplexer 395, and the output message (which may have been stored in the AF output queue 390) is transmitted over the asynchronous packet network 265, and the output data in any of the output registers 380 is transmitted over the synchronous mesh communication network 275.

FIG. 10 is a detailed block diagram of a representative embodiment of the memory control circuitry 330 (with associated control registers 340) of the hybrid thread fabric configurable compute circuit (tile) 210. FIG. 10 shows a diagram of the read index logic of the tile memories 325 of the memory control circuit 330, and is replicated (not separately shown) for each memory 325. Instruction RAM315 has a field that specifies which area of tile memory 325 is being accessed, and a field that specifies the access index pattern. Memory area RAM 405 (part of control register 340) specifies an area read mask that provides the upper memory address bits of a particular area. The mask is OR' ed with the lower address bits supplied by the read index select multiplexer 403 (OR gate 408). The memory area RAM 405 also contains a read index value when the tile memory 325 is accessed in FIFO mode. When accessed in FIFO mode, the read exponent value in RAM 405 is incremented and written back. In various embodiments, the memory area RAM 405 may also hold the top of the TID stack by a nested loop, as described below.

Fig. 10 also shows the control information (INSTR, XID, TID) required by the synchronous mesh communication network 275 a few clocks earlier than the data input. For this reason, control information is sent from the previous tile 210 several clocks before the data is sent. This buffering of information by the synchronous mesh communication network 275 reduces the overall pipeline stage per tile 210, but makes using the calculated values as an index into the tile memory 325 challenging. Rather, data of the synchronous mesh communication network 275 may arrive too late to be used as an index to the tile store 325. An architectural solution to this problem is to provide the computed index from the previous tile 210 in a variable index register of the control registers 340. Another input 270A then causes the variable index register to be used as an index to the tile memory 325.

The asynchronous packet network 265 is used to perform operations that occur asynchronously to the synchronous domain. Each tile 210 contains an interface to an asynchronous packet network 265, as shown in FIG. 9. The incoming interface (from communication line 280A) is an AF input queue 360 (as a FIFO) to provide storage for messages that cannot be processed immediately. Similarly, the outgoing interface (to communication line 280B) is an AF output queue 390 (as a FIFO) to provide storage for messages that cannot be immediately issued. Messages passing through the asynchronous packet network 265 may be classified as data messages or control. The data message contains a 64-bit data value that is written to one of the tile memories 325. The control message is used to control thread creation, release resources (TID or XID), or issue external memory references. Table 6 below lists the asynchronous packet network 265 outbound message operations:

table 6:

asynchronous packet network 265 allows messages to be sent and received from tiles 210 in different synchronous domains. In some cases, it makes sense for the synchronization domain to send a message to itself, e.g., when the base tile 210 of the synchronization domain allocates a TID and the TID is to be released by the same synchronization domain.

If the synchronous domain of each tile 210 generates and sends messages faster than they can be received, routed, and processed by the asynchronous packet network 265 or a receiving endpoint, the asynchronous packet network 265 may become congested. In this case, a backpressure mechanism is provided to alleviate any such congestion. Fig. 22 is a block diagram of a representative flow control circuit 385. Generally, there is at least one flow control circuit 385 per HTF circuit group 205. The asynchronous fabric output queue 390 of the tile 210 will hold the message while waiting to be sent on the asynchronous packet network 265. The output queue 390 is provided with a predetermined threshold that when reached will cause the output queue 390 of tile 210 to generate an indicator, such as setting a bit asserted as a "stop" signal 382 provided to flow control circuit 385 on communication pipeline 384. Each communication pipeline 384 is provided from a tile 210 in the HTF circuit group 205 to the flow control circuit 385. The flow control circuit 385 has one or more or gates 386 that, for the duration of time that any of the tiles 210 generates the stop signal 382, will continue to assert the stop signal 382 distributed on the communication pipeline 388 of all tiles 210 within the affected HTF circuit group 205,

the stop signal 382 may be distributed through a dedicated communication pipeline 388 or through the synchronous mesh communication network 275, the dedicated communication pipeline 388 not being part of the synchronous mesh communication network 275 or the asynchronous packet network 265 as shown. In a representative embodiment, there is a single stop signal 382 that all of the tile output queues 390 within the HTF circuit group 205 may assert, and all of the tiles 210 within the HTF circuit group 205 are held (paused or stopped) while the stop signal 382 asserts. This stop signal 382 continues to allow all AF input queues 360 to receive AF messages and packets, thereby avoiding deadlock, and causing all sync domain pipelines to be held or suspended (which also prevents additional AF data packets from being generated). The stop signal 382 allows the asynchronous packet network 265 to drain the output queue 390 of the tile 210 to the extent that the number of messages in the output queue 390 (that trigger the output queue 390) falls below a threshold level. Once the size of the output queue 390 falls below a threshold level, the signal of the tile 210 through the communication pipeline 384 returns to zero (the stop signal 382 is no longer generated). When this occurs for all tiles 210 in the HTF circuit group 205, the signal on the communication pipeline 388 also returns to zero, meaning that the stop signal is no longer asserted and the stop or pause on the tile 210 is ended.

The first or "base" tile 210 of the synchronization domain is responsible for initiating a worker thread through the multi-tile 210 synchronization pipeline. The new thread may be initiated at a predetermined pace. The cadence interval is referred to herein as the "spoke count," as mentioned above. For example, if the spoke count is three, then a new worker thread may be launched on base tile 210 every three clocks. If a new thread is skipped to start (e.g., no thread is ready to start), then a full spoke count must be waited before another thread can start. A spoke count greater than one allows each physical tile 210 to be used multiple times within the synchronization pipeline. As an example, if the synchronization domain is performed on a single physical tile and the spoke count is one, the synchronization domain may contain only a single tile time slice. If the spoke count is four for this example, the synchronization field may contain four tile time segments. In general, the synchronization domains are performed by multiple tiles 210 interconnected by synchronization links of the synchronous mesh communication network 275. The synchronization domains are not limited to a subset of the tiles 210 within the group 205, i.e., multiple synchronization domains may share the tiles 210 of the group 205. A single tile 210 may join multiple synchronization domains, e.g., spoke 0, with tile 210 acting on synchronization domain "A"; spoke 1, the tile 210 acting on synchronization field "B"; spoke 2, the tile 210 acting on synchronization field "a"; and spoke 3, said tile 210 acting on the synchronization field "C". Thread control of tiles is described below with reference to FIG. 11.

Fig. 11 is a detailed block diagram of a representative embodiment of the thread control circuit 335 (with associated control registers 340) of the hybrid thread fabric configurable compute circuit (tile) 210. Referring to FIG. 11, a number of registers are contained within control registers 340, namely TID pool register 410, XID pool register 415, suspend table 420, and complete table 422. In various embodiments, the data of completion table 422 may be equally saved in pause table 420, and vice versa. Thread control circuitry 335 contains a continue queue 430, a re-entry queue 445, a thread control multiplexer 435, a run queue 440, an iteration increment 447, an iteration index 460, and a loop iteration count 465. Alternatively, the continue queue 430 and run queue 440 may be equivalently embodied in the control register 340.

Fig. 12 is a diagram of the tiles 210 forming the first and second synchronization domains 526, 538 and representative asynchronous packet network messaging. One difficulty with asynchronous packet network 265 is that the required data may arrive at tile 210 at different times, which may make it difficult to ensure that the started thread can run to completion with a fixed pipeline delay. In a representative embodiment, the tiles 210 that form the synchronous domain do not execute the computing thread until all resources are ready, e.g., by having available the required data, any required variables, etc., that have all been distributed to the tiles over the asynchronous packet network 265, and thus can arrive at the specified tile 210 at any of various times. In addition, data may need to be read from system memory 125 and passed through asynchronous packet network 265, and thus may also arrive at a specified tile 210 at any of a variety of times.

To run a thread to completion with a fixed pipeline delay, the representative embodiment provides a completion table 422 (or a stall table 420) indexed by the thread's TID at the base tile 210 of the synchronization domain. Completion table 422 (or halt table 420) maintains a count of dependent completions that must be received before execution of the initiating thread. Completion table 422 (or pause table 420) contains a field named "completion count" which is initialized to zero upon reset. The count field is modified using two types of AF messages. The first message type is a thread start or continue message and increments a field with a count indicating the number of dependencies that must be observed before a thread can start in the synchronization domain. The second AF message type is a completion message and the count field is decremented one by one, indicating that a completion message has been received. Once the thread start message is received and the completion count field reaches zero, the thread is ready to start.

As shown in FIG. 12, tile 210B of the first synchronization domain 526 has transmitted an AF memory payload message (293) to the memory interface 215 over the asynchronous packet network 265, and the memory interface 215 will then generate another message (296) to the system memory 125 over the first interconnection network 150 to obtain the requested data (returned in message 297). However, the data will be used by tile 201E in second synchronization domain 538 and transferred (message 294) to tile 201E. When the first synchronization domain 526 has completed its pipeline portion, one tile (210C) in the first synchronization domain 526 transmits an AF-continuation message (291) to the base tile 210D of the second synchronization domain 538. The AF resume message (291) contains the TID of the thread of the first sync domain 526 (e.g., TID 1, which is also contained in the other messages of fig. 12), and a completion count of 1, indicating that the base tile 210D of the second sync domain 538 will wait to start the thread on the second sync domain 538 until it receives a completion message.

Thus, when tile 210 receives such data (e.g., tile 210E in FIG. 12), it acknowledges the receipt by sending a completion message (with Thread ID (TID)) back to base tile 210, where base tile 210 is base tile 210D of second synchronization domain 538. As part of the configuration provided to base tile 210 (either at initial configuration or as part of a resume message) and stored as a completion count in completion table 422 (or pause table 420), base tile 210D knows the number of such completion messages that must be received in order for tile 210 of the synchronization domain to begin executing base tile 210, which in this case is second synchronization domain 538. When a completion message is received by base tile 210, the completion count of the pause table is decremented for the particular thread having the TID, and base tile 210 may begin executing the thread when it reaches zero for that thread indicating that all required completion messages have been received. To begin execution, the thread's TID is passed to the continue queue 430, from which it is selected to run (at the appropriate spoke count for the appropriate time slice of tile 210). It should be noted that data determined during the execution of a thread and that may be passed between the tiles 210 of the synchronization domain via the synchronous mesh communication network 275 does not require a completion message.

This type of thread control has several advantages. This thread control waits for all dependencies to complete before starting a thread, allowing the started thread to have a fixed synchronized execution time. The fixed execution time allows the use of register stages in the entire pipeline, rather than a FIFO. In addition, while one thread of a tile 210 may wait to execute, other threads may execute on the tile 210, thereby providing a much higher overall throughput and minimizing idle time and unused resources.

Similar control is provided when traversing a synchronization domain, such as for executing multiple threads (e.g., for related compute threads forming a compute fabric). For example, the first synchronization domain will inform the base tile 210 of the next synchronization domain in a continue message transmitted over the asynchronous packet network 265 about the number of completion messages it needs to receive in order to start executing the next thread. As another example, for an iterative loop that spans a synchronous domain, the first synchronous domain will inform the base tile 210 of the next synchronous domain in a loop message (with a loop count and the same TID) transmitted over the asynchronous packet network 265 about the number of completion messages it needs to receive in order to begin executing the next thread.

It should also be mentioned that various delays may need to be implemented, for example both would be needed when the third tile 210 makes the next calculation when the first data is available from the first tile 210 while the second data is still being determined by the second tile 210. For this case, a delay may be introduced at the output register 380 of the first tile 210 creating the first data or in the tile memory 325 of the third tile 210. This delay mechanism also applies to data that may be passed from the first tile 210 to the third tile 210 using the second tile 210 as a path.

Pause table 420 is used to hold or pause the creation of a new sync thread in tile 210 until all required completion messages are received. Threads from the previous synchronization domain send messages to base tile 210 that contain the number of completion messages expected for the new synchronization thread and the actions to take when all completion messages are received. The actions include: call, continue, or loop. Many of the pause operations are typically run in parallel. All messages (i.e., a set of pause and completion messages) for a particular pause operation will have the same pause index within the corresponding message. The pause index is the TID from transmit tile 210. The entry of the pause table 420 is initialized to inactive with a completion increment count of zero. Receipt of the pause message increments the increment count by the number of required completion counts and sets the entry of pause table 420 to active. Receipt of the completion message decrements the increment count one by one. It should be noted that the completion message may arrive before the associated pause message, such that the increment count is negative. When an entry of pause table 420 is active and the increment count is zero, the associated activity (e.g., a new thread) is initiated (and the entry of pause table 420 is deactivated).

The continue (or call) queue 430 keeps the thread ready to start on the synchronization domain. When all completions for the call operation are received, the thread is pushed into the continue queue 430. It should be noted that the threads in the continue queue 430 may require that the TID and/or XID be allocated before the threads can start on the synchronization domain, e.g., if all TIDs are in use, the threads in the continue queue 430 may start once the TID is released and available, i.e., the threads may wait until the TID and/or XID are available.

The re-entry queue 445 prepares threads to start on the synchronization domain, with execution of those threads having precedence over those in the resume queue 430. When all completions for continuation operations are received and the thread already has a TID, the thread is pushed into the re-entry queue 445. It should be noted that the thread in the re-entry queue 445 does not need to allocate a TID. Separate re-entry queues 445 and continue/continuation queues 430 are provided to avoid deadlock situations. One particular type of continued operation is looping. The loop message contains a loop iteration count. The count is used to specify the number of times the thread is started once the pause operation is completed.

An optional priority queue 425 may also be implemented such that any thread in the priority queue 425 having a thread identifier executes before any thread in the execution continue queue 430 or the re-entry queue 445 having a thread identifier.

The state of the iteration index 460 is used when starting a thread for loop operations. The iteration index 460 is initialized to zero and starts incrementing for each thread. The iteration index 460 is pushed into the run queue 440 with thread information from the continue queue 430. The iteration index 460 may be used as a selection of the data path input multiplexer 365 within the first tile (base tile) 210 of the synchronization domain.

Loop iteration count 465 is received as part of the loop message, saved in pause table 420, pushed into continue queue 430, and then used to determine when the appropriate number of threads for the loop operation have started.

The run queue 440 holds ready-to-run threads that have an assigned TID and/or XID and can execute upon the occurrence of the appropriate spoke count clock. When a thread starts on the synchronization domain, TID pool 410 provides a unique Thread Identifier (TID) to the thread. Only threads within the continue queue 430 can obtain the TID. The XID pool 415 provides a unique delivery identifier (XID) to a thread when the thread starts on the synchronization domain. The thread from the continue queue 430 may obtain an XID. The assigned XID becomes the XID _ WR for the thread started.

For any given or selected program to be executed, code or instructions of that program, written or generated in any suitable or selected programming language, are compiled for system 100 and loaded into system 100, including instructions of HTP300 and HTF circuit 200, as well as any instructions applicable to host processor 110, to provide the selected configuration to system 100. Thus, the respective instruction sets for one or more selected computations are loaded into instruction RAM315 and spoke RAM320 of each tile 210, and into any of the respective registers maintained in memory interface 215 and HTF scheduling interface 225 of each tile 210, thereby providing configuration for HTF circuit 200, and, depending on the program, also into HTP 300.

For example, but not limiting of, the core is started with a job descriptor message containing zero or more arguments, typically generated by the host processor 110 or the HTP300, for execution by the one or more HTF circuits 200. The arguments are sent to the HTF scheduling interface 225 within the job descriptor AF message. These arguments provide thread-specific input values. Host processor 110 or HTP300 may use its respective operating system ("OS") to send a "host" message to the core that initializes the location of tile memory 325, where such host message provides non-thread-specific values. A typical example is a host message that sends the base address of the data structure for use by all core threads.

Host messages sent to a core are sent to all HTF circuit groups 205 where the core is loaded. In addition, the order of sending host messages and sending core schedules is maintained. Sending a host message essentially causes the core to be idle before sending the message. The completion message ensures that the write to the tile memory 325 has completed before the new sync thread is started.

Control messaging over the asynchronous packet network 265 is as follows:

(1) HTF schedule interface 225 receives the host message and sends an AF data message to destination tile 210. Destination tile 210 writes the data of the AF data message to the selected memory.

(2) The destination tile 210 sends an AF complete message to the HTF scheduling interface 225 confirming that the tile write is complete.

(3) The HTF scheduling interface 225 keeps all new core threads from starting until all message writes have been acknowledged. Once confirmed, the HTF scheduling interface 225 transmits an AF-call message to the underlying tile of the synchronization domain to start the thread.

The HTF scheduling interface 225 is responsible for managing the HTF circuit group 205, and includes: (1) interacting with the software of system 100 to prepare HTF circuit group 205 for use by a process; (2) a tile 210 that schedules work to a group of HTF circuits 205, including a group of HTF circuits 205 loaded with one or more configurations of cores; (3) the background of the HTF circuit group 205 is saved and restored to the memory 125 for breakpoints and exceptions. As mentioned above, the registers 475 of the HTF scheduling interface 225 may contain various tables to track what has been scheduled to and received from any of the various tiles 210, such as any of the messaging used in representative embodiments. Primitive operations (private operations) of the HTF dispatch interface 225 for performing these operations are listed in table 7.

Table 7:

fig. 16 and 17 provide examples of message passing and thread control within the system 100, where example calculations are provided to illustrate how the synchronous mesh communication network 275 and the asynchronous packet network 265 cooperate to perform a simple core, here solving for the simple expression R ═ a + B. To illustrate such messaging, the computation has been divided across two different synchronization domains 526 and 538. The variable B is passed as a host message to all HTF circuit groups 205 and the address of a is passed as an argument to the call in the job descriptor packet. The result R is passed back through the first interconnection network 150 in a return packet. The examples make few calculations, so the number of messages per calculation performed is extremely high. The performance of the HTF circuit 200 is much higher when a large number of computations are performed within a loop such that the number of messages per computation is low.

Fig. 16 is a diagram of a representative hybrid thread fabric configurable compute circuit (tile) 210 forming a synchronous domain and a representative asynchronous packet network messaging for the HTF circuit group 205 to perform computations. Fig. 17 is a flow diagram of representative asynchronous packet network messaging and execution by the hybrid thread fabric configurable compute circuit (tile) for the HTF circuit group 205 to perform the computation of fig. 16. First, at step 506, host processor 110 sends a message to all HTF circuit groups 205 within the node (504). The message is the value of variable B. The message is contained in a single packet, commonly referred to as a work descriptor packet, that is written to dispatch queue 105 (shown in FIGS. 1 and 2) of HIF 115 associated with the process. HIF 115 reads messages from dispatch queue 105 and sends a copy of the packet to each HTF circuit group 205 assigned to a process. The dispatch interface 225 of the assigned HTF circuit group 205 receives the packet. It should also be noted that HIF 115 performs various load balancing functions for all HTP300 and HTF200 resources.

At step 510, host processor 110 sends a call message to one of the HTF circuit groups 205 assigned to the process (508). The host processor 110 may manually target a particular HTF circuit group 205 to execute a core or allow the HTF circuit group 205 to be automatically selected. Host processor 110 writes the call parameters to a scheduling queue associated with the process. The call parameters include the core address, the start instruction, and a single argument (the address of variable a). Host Interface (HIF)115 reads the queued messages and forwards the messages as packets over first interconnect network 150 to assigned HTF circuit group 205, typically HTF circuit group 205 with the least load.

At step 514, the HTF scheduling interface 225 receives the host message (value of variable B), waits until all previous calls to the HTF circuit group 205 are completed, and sends the value to the first selected destination tile 210D over the asynchronous packet network 265 using an AF message (512). HTF schedule interface 225 has a table of information stored in register 475 for each possible host message that indicates destination tile 210D, tile memory 325, and the memory region (in RAM 405) of that tile 210D. At step 518, tile 210D writes the value to memory 325 in tile 210D using the message information and, once the value is written to tile memory, sends a write complete AF message (516) back to HTF scheduling interface 225 over asynchronous packet network 265.

The HTF scheduling interface 225 waits for all message completion messages to arrive (in this case, only a single message). Once all completion messages have arrived, the HTF scheduling interface 225 sends a call argument (the address of variable A) in an AF message (520) to the second selected destination tile 210B to write the value into tile memory 325, step 522. HTF scheduling interface 225 has a call argument table stored in register 475 and indicating destination tile 210B, tile memory 325, and the memory region (in RAM 405) of that tile 210B.

Subsequently, at step 528, the HTF scheduling interface 225 sends an AF-invocation message to the base tile 210A of the first synchronization domain 526 (524). The AF call message indicates that a single completion message should be received before the call can begin execution through the Sync tile 210 pipeline. The required completion message has not yet arrived and the call is suspended.

At step 532, once the value is written to the tile memory 325 of the tile 210B, a write complete message (530) is sent by tile 210B to the base tile 210A of the first synchronization domain 526 via the asynchronous packet network 265.

Base tile 210A has received the call message (524) and the required completion message (530), and is now ready to initiate execution on the synchronization domain 526 (tile pipeline). At step 536, base tile 210A initiates execution by providing initial instructions and a valid signal (534) to tile 210B via synchronous mesh communication network 275. Base tile 210A assigns an XID value from XID pool 415 for use in first synchronization field 526. If XID pool 415 is empty, base tile 210A must wait until the XID is available to begin the synchronization pipeline.

When execution continues, at step 542, tile 210B or another tile 210E within first synchronization domain 526 sends an AF-continuation message to base tile 210C of second synchronization domain 538 (540). The continue message contains the number of required completion messages (in this case, a single completion message) that must be reached before the second synchronization domain 538 can initiate execution. The continue message also contains a delivery id (xid). The XID is used as a write index in one synchronization field (526) and then as a read index in the next synchronization field (538). XID provides a tile memory index common to different synchronization domains.

At step 546, tile 210B or another tile 210F within first synchronization domain 526 sends an AF memory load message to memory interface 215 of HTF circuit group 205 (544). The message contains the request ID, virtual address, and XID to be used as an index to write payload data to the destination tile (210G) memory 325.

The memory interface 215 receives the AF load message and translates the virtual address to a node local physical address or a remote virtual address. The memory interface 215 indexes in a request table stored in the register 485 that contains parameters of the memory request using the request ID of the AF message. At step 550, the memory interface 215 issues a load memory request packet (548) for the first interconnect network 150 with the translated address and size information from the request table.

Subsequently, at step 554, the memory interface 215 receives 552 a memory response packet with load data (value of variable a) over the first interconnection network 150. In step 558, memory interface 215 sends an AF message (556) to tile 210G within second synchronization domain 538. The AF message contains the value of variable a and writes the value to the tile memory using the parameters from the request table stored in register 485.

Once the value is written to the tile memory, an AF-write complete message (560) is sent to the base tile 210C of the second synchronous domain 538 over the asynchronous packet network 265 at step 562.

Base tile 210C of second synchronization domain 538 receives the continue message (540) and the required complete message (560) and is ready to initiate execution on second synchronization domain 538 (the tile pipeline). At step 566, base tile 210C initiates execution by providing an initial instruction and a valid signal (564) to tile 210 (e.g., tile 210H) of second synchronization domain 538. Base tile 210C also allocates an XID value from the XID pool for use in second sync field 538.

At step 568, tile 210H within the second synchronization domain performs an addition operation of the B value passed from the host message and the A value read from system memory 125. The resulting value is the R value of the expression.

At step 572, tile 210J within the second synchronization domain sends an AF message containing the R value to HTF scheduling interface 225 (570). The AF message contains the XID value assigned from base tile 210A. The XID value is used as an index within the HTF scheduling interface 225 to a table stored in the register 475 that holds the return parameter until the value has been read and a return message has been generated for transmission over the first interconnection network 150.

At step 576, the AF message (574) from the second synchronization domain (tile 210K) sends the XID value assigned in the first synchronization domain back to base tile 210A for return to the XID pool. At step 580, a first interconnection network 150 message (578) from HTF dispatch interface 225 is sent to HIF 115. HIF writes return work descriptors to the dispatch return queue. At step 584, once first interconnection network 150 has sent a return packet, the XID value is sent in an AF message (582) over HTF scheduling interface 225 to base tile 210C of second synchronization domain 538 for return to the XID pool.

It should be noted that in this example of fig. 16 and 17, to illustrate the various AF messages that may be used for thread control, a number of tiles 210 are used. In practice, particularly for such simple calculations, much fewer tiles 210 may be used in order to perform the calculations entirely within a single synchronization domain.

Another message passing example for thread control across multiple synchronous domains is provided in FIGS. 18 and 19, again using AF complete and continue messages over asynchronous packet network 265. FIG. 18 is a diagram of representative hybrid thread fabric configurable compute circuits (tiles) forming synchronous domains and representative asynchronous packet network messaging for a group of hybrid thread fabric circuits to perform computations. FIG. 19 is a flow diagram of representative asynchronous packet network messaging and execution by a hybrid thread fabric configurable compute circuit (tile) for a group of hybrid thread fabric circuits to perform the computation of FIG. 18.

For this example, the HTF scheduling interface 225 sends a message to the base tile 210A of the first synchronization domain 526. The message starts a thread on the first synchronization field 526. The thread sends a thread continue message to the second synchronization domain 538. The continue message indicates that the thread is to be started on the second synchronization domain 538 when the specified number of completion messages have been received. The first synchronization domain 526 sends a completion message to the second synchronization domain 538 causing the second thread, which completes and begins synchronization, to pause. The second thread sends a completion message back to the HTF dispatch interface 225 indicating that the second synchronous thread is complete, completing the dispatched core. An additional message to release TID and XID identifiers is shown in figure 18.

At step 604, the HTF scheduling interface 225 receives the work descriptor packet (602), ensures that the correct core configuration has been loaded, determines that the XID and TID pools are not empty, and obtains the XID and TID values of the new work thread from the TID and XID pools stored in registers 475 within the HTF scheduling interface 225. At step 608, HTF scheduling interface 225 begins core execution by sending an AF invocation message (606) to base tile 210A of first syncfield 526 (with an assigned XID and TID values, e.g., XID 3 and (first type) TID 11). At step 610, base tile 210A receives an AF call message (606), determines that the TID and XID pools (410, 415) are not empty, assigns TID and XID values (e.g., XID _ WR of 7 and TID of the second type of 1), and spoke Ram 320 is selecting the base tile as input to the tile datapath so that it begins execution with a first specified instruction specified by the instruction index saved in its spoke Ram 320, rather than possibly executing the instruction according to the instruction index that may be provided by a previous tile 210 (e.g., as discussed in more detail below with respect to conditional execution).

At step 614, base tile 210A starts the first thread (612) with the first sync field 526, where a TID value is assigned (e.g., (second type) TID 1), XID _ RD is assigned a value from AF invocation message (606) (e.g., XID _ RD 3), XID _ WR is assigned a value obtained from the XID pool (e.g., XID _ WR 7), and the TID is from AF invocation message (606) (e.g., (first type) TID 11).

At step 618, when the computation continues in the first synchronization domain 526, another tile 210B within the first synchronization domain 526 sends an AF-continuation message to the base tile 210D of the second synchronization domain 538 (616). When the appropriate number of completion messages arrive, the AF-continue message (616) provides the information needed to start the second thread on the second sync field 538. The AF continue message (616) contains a completion count field having a value specifying the number of required completion messages. One tile (210C) in the first synchronization field 526 also transmits a free XID (e.g., XID ═ 3) message (641) to the HTF scheduling interface 225.

AF resume message (616) may include the TID or XID _ WR value as an index to pause table 420 on destination base tile 210D. At step 620, the pause table accumulates the received completion messages and determines when the desired number has been reached and a new thread can begin. Tile 210B sending AF resume message (616) sends the selected TID or XID _ WR value as PID (pause table index) and changes the downstream TID value of the sync field to the selected value (e.g., (first type) TID 11, (second type) TID 1). This new TID value is passed in all AF complete messages to be used as an index to the pause table 420 of base tile 210D.

At step 624, an AF complete message (622) with a TID value (e.g., (first type) TID of 11) is sent to base tile 210D of second synchronization domain 538. The AF complete message (622) decrements the entry for the delta count field in pause table 420 of base tile 210D. The AF complete message (622) and AF continue message (616) may arrive in any order. The last arriving message will observe that the AF resume message (616) has arrived and the delta count field in the pause table 420 has reached zero. This condition indicates that the suspension has completed and that the second sync thread (626) may begin. At step 628, base tile 210D also determines or observes that the pause operation has completed, determines that the XID identifier pool is not empty and assigns an XID (e.g., XID 5), and spoke Ram is selecting the base tile as an input to the tile data path.

Next, at step 630, base tile 210D starts a second sync thread (626) through second sync field 538, where the TID and XID _ RD are assigned a value obtained from AF continuation message (616) (e.g., TID 11 (first type), TID 1 (second type), and XID _ RD 7). XID _ WR is assigned the value obtained from the XID pool in step 628 (e.g., XID _ WR of 5).

When the computation of the second sync thread (626) is complete, several housekeeping messages are sent through the various tiles 210 of the second sync domain 538. At step 634, an AF free TID message (632) is sent to the base tile 210A of the first sync domain 526, and at step 636, the receiving base tile 210A adds the TID value to the TID pool 410 so it is available for use again. At step 640, an AF free XID message (638) is sent to base tile 210A of first synchronization domain 526, and at step 642, the receiving base tile 210 adds the XID value to XID pool 415 so it is again available for use. At step 646, an AF-done message (644) is sent to the HTF dispatch interface 225 indicating that the second sync thread 626 has completed. The HTF schedule interface 225 has a count of expected completion messages. The AF complete message (644) contains an XID _ WR value and a TID value ((first type) TID 11) to the second synchronization field 538 of the scheduling interface. Next, at step 650, HTF scheduling interface 225 sends an AF free XID message (648) to base tile 210D of second synchronization domain 538. Then, at step 652, receiving base tile 210D adds the XID value to XID pool 415 so it is again available for use.

Data is transferred from one synchronization domain to the next using a data transfer operation. Typically, data transfer is used in conjunction with load operations that obtain data from memory 125. Once the payload data reaches the second synchronization domain 538, the calculated data from the first synchronization domain 526 is needed in the second synchronization domain 538. In this case, a single pause is sent from the first synchronization domain 526 to the second synchronization domain 538, which contains a completion message from the total count of all load and data transfer operations.

Next, data transfer operations between the synchronization domains utilize a variation of step 624. Instead of sending the AF complete message in step 624, the first synchronization domain 526 sends an AF data message with data to the second synchronization domain 538 (622). Destination tile 210 in second sync field 538 writes the data within the AF data message to selected tile store 325. The tile 210 receiving the AF data message then sends an AF complete message to the base tile 210 of the second synchronization field 538. Then, once the payload data reaches the second sync domain 538, the base tile 210 of the second sync domain 538 may start a second thread on the second sync domain 538.

Control of the iterative thread loop across the synchronization domains utilizes a similar control messaging pattern. The circular message flow allows multiple synchronization domains to start with a single circular message. Each of the started synchronous threads can access its iteration index. FIG. 20 is a diagram of a representative hybrid thread fabric configurable compute circuit (tile) forming a synchronous domain and a representative asynchronous packet network messaging for the group of hybrid thread fabric circuits to perform cycles in a computation. Fig. 21 is a flow diagram of representative asynchronous packet network messaging and execution by a hybrid thread fabric configurable compute circuit (tile) for a group of hybrid thread fabric circuits to perform a loop in the computation of fig. 20.

FIG. 20 shows three synchronization domains, namely a first synchronization domain 526, a second synchronization domain 538, and a third synchronization domain 654. The first synchronization field 526 is used for pre-loop setup, the second synchronization field 538 begins with an iteration count (IterCnt) of the thread number, and the final third synchronization field 654 is a post-loop. It should be noted that loops may likewise be nested using additional index levels, as discussed in more detail below.

Referring again to FIG. 11, control register 340 contains a completion table 422 (or a pause table 420). For loops, two types of completion information are maintained in completion table 422, namely a first completion count related to the number of completion messages that should arrive before a thread can begin, as discussed above, and a second loop or iteration (completion) count for tracking the number of loop threads that have begun and completed. The cycle begins by sending an AF cycle message containing the cycle count (and the respective TIDs, as discussed below) to the base tile 210 of the synchronization domain. The loop count is stored in completion table 422 (or pause table 420) and is used to determine the number of times a new thread is started on the synchronization domain. In one embodiment, each thread starts with a new TID obtained from TID pool 410. For example, each active thread has a unique TID, allowing a thread private variable to exist. The thread of the nested loop can access the data or variable of its own TID, plus the TID of the outer loop. In a second embodiment, discussed below, the TIDs are reused by successive threads of the loop.

The TID is returned to TID pool 410 by sending an AF message from a tile within the synchronization domain at the termination of the thread, which may be an AF complete message or, for the second embodiment, an AF re-entry message. This may also be achieved by a free TID message to the base tile 210. The AF message returning the TID to the pool or reusing the TID is also used by the loop base tile 210 to maintain a count of the number of active loop threads in the loop count of the completion table 422 (or the pause table 420). When the number of active loop threads reaches zero, the loop is complete. When the loop completion is detected by the loop count going to zero, an AF completion message is sent to the post-loop synchronization domain to notify of completion. This mechanism provides a minimum (if not zero) idle loop for nested loops, resulting in better performance.

Referring to FIGS. 20 and 21, at step 658, the first synchronization domain 526 (shown as tile 210B, but which may be any other tile in the first synchronization domain 526) sends an AF-continue message (656) to the base tile 210D of the third post-loop synchronization domain 654 to wait for a loop complete message (which will come from the second synchronization domain 538). At step 664, one tile in the first synchronization domain 526 (shown as tile 210B) sends an AF loop message with an iteration (loop) count (660) to the base tile 210C of the loop domain, which is the second synchronization domain 538. Base tile 210C starts a loop (IterCnt) thread (662, e.g.,662₀，662₁to via 662_N-1, where "N" is the iteration count (IterCnt)). Each thread 662 has the same TID and XID RD identifier. XID _ WR identifiers are assigned by cycle base tile 210C (if enabled). The iteration index (i.e., ordered from zero to Itercnt-1(N-1)) can access the data path multiplexer selections in base tile 210C as a circular field.

Next, at step 668, each iteration of the loop domain sends an AF-complete message (666) back to the base tile 210C of the second sync (loop) domain 538. It should be noted that the second synchronization domain 538 shown in fig. 20 may actually be several synchronization domains. In the case where multiple synchronization domains form a loop, the thread of the last synchronization domain in the loop should transmit an AF-complete message (666) so that the third post-loop synchronization domain 654 waits appropriately for all loop operations to complete. Once the base tile 210C of the second synchronization (loop) domain 538 receives all iteration AF completion messages (666), it sends a loop AF completion message (or AF continue message) to the base tile 210D of the third (post-loop) synchronization domain 654 (670).

For example, but not limiting of, for cycles including nested and double nested loops, several additional novel features are utilized in order to minimize idle time, including re-entry queue 445 and additional sub-TIDs, e.g., TIDs for outermost loops₂TID for intermediate or mediated cycles₁And TID for innermost cycle₀. Each thread executing in a loop also has a unique TID, e.g., for an outer loop that will have fifty iterations₂0-49, which are also used for corresponding completion messages when each iteration completes execution, again by way of example and not limitation.

Referring again to FIG. 11, several novel mechanisms are provided to support efficient cycling and minimize idle time. For example, a loop with a data-dependent end condition (e.g., a "while" loop) requires that the end condition be calculated when the loop is executed. Also, for control and execution of loops, if all TIDs are allocated from TID pool 410, but the thread executing the head of the queue is a new loop that cannot execute due to the missing available TIDs, a potential deadlock problem may arise, preventing other loop threads from completing and releasing their allocated TIDs. Thus, in a representative embodiment, control register 340 includes two separate queues for ready-to-run threads, with a first queue used to initiate a new loop (continue queue 430, also for non-looping threads) and a separate second queue (re-enter queue 445) used for loop continuation. The continue queue 430 allocates TIDs from TID pool 410 to start a thread, as previously discussed. The re-entry queue 445 uses the previously allocated TID as each iteration of the loop thread executes and transmits an AF re-entry message with the previously allocated TID. Any Thread (TID) in the re-enqueue 445 will move to the run queue 440 ahead of Threads (TID) that may be in other queues (continue queue 430). Thus, once the loop starts, loop iteration can proceed extremely quickly, with each subsequent thread of the loop starting quickly using a separate re-entry queue 445, and further, without potential deadlock issues. In addition, the re-entry queue 445 allows this to be performed quickly, which is extremely important for loops with data-dependent end conditions, which can now run efficiently without interruption to the last iteration that produced the data-dependent end condition.

Referring again to fig. 9 and 10, the control register 340 includes a memory area RAM 405. In various embodiments, the memory area RAM405 may also maintain the top of the TID stack (with identifier) through a nested loop, as described below. As mentioned above, each nested loop initiates a thread with a new (or reused) set of TIDs. The thread that is looping may need to access its TID as well as the TID of the outer loop thread. Accessing the TID of each nested loop thread allows access to a private variable of each thread, e.g., TIDs of different levels or types as described above, TIDs₀、TID₁And TID₂. The top of the stack TID identifier indicates the TID of the active thread. Stack top of TID identifiers for selection of use of three TIDs (TIDs)₀、TID₁And TID₂) Which performs the respective operations. The top of the three TIDs and the stacked TID identifier are included in a network passing through a synchronous mesh communication network275 and is therefore known to each thread. Because multiple TIDs are included within the synchronous fabric message and include the top of the stacked TID identifier, the multiple TIDs allow threads in the nested loop to access variables from any level within the nested loop threads. The private thread variable is accessed using the selected TID and the tile memory area RAM405 identifier.

Another novel feature of the present disclosure is a mechanism to order loop thread execution to handle loop iteration dependencies that also accommodates any delays in completion messages and data received over the asynchronous packet network 265. FIG. 23 is a diagram of placement of loops in the tile 210 forming synchronous domains and representative asynchronous packet network messaging and synchronous messaging for execution of computations by the hybrid thread fabric circuit group. As shown in fig. 23, a plurality of synchronization domains 682, 684, and 686, i.e., a second synchronization domain 682, a third synchronization domain 684, and a fourth synchronization domain 686, as well as a pre-loop first synchronization domain 526 and a post-loop (fifth) synchronization domain 654, are involved in performing loop calculations. The loop computation can be any kind of loop, including nested loops, in which case there is data dependency within each loop. For example, these data dependencies may occur within a single iteration, such as involving AF messaging over asynchronous packet network 265 when information from memory 125 is needed. Thus, thread execution should proceed in a defined order, not just when any particular thread has a zero completion count (which means that the thread does not wait for any data where all completion messages for the thread have arrived).

To provide for ordered loop thread execution, in a representative embodiment, additional messaging and additional fields are utilized in completion table 422 for each loop iteration. The loop base tile 210B provides four pieces of information (for each loop iteration) that are passed through each synchronization domain 682, 684, 686 in a synchronization message 688 and through the synchronous mesh communication network 275 (i.e., to each successive tile 210 in the given synchronization domain), and an AF continuation message 692 through the asynchronous packet network 265 to the base tile 210 of the successive synchronization domain (which is then passed in a synchronization message to each successive tile 210 in the given synchronization domain). Those four information fields are then stored and indexed in completion table 422 and used for comparison as the loop execution advances. The four pieces of information are: a first tag indicating a first thread of the set of threads to loop, a second tag indicating a last thread of the set of threads to loop, the TID of the current thread, and the TID of the next thread. The TID of the current thread is obtained from the TID pool, and the TID of the next thread is the TID from the pool to be provided for the next thread. These four pieces of information are used by the base tile of each successive synchronization field to order the start of the thread. If the thread's dependency count has reached zero and the thread is the first thread to loop, or the thread TID is equal to the next TID of the previously started thread, then it may be started.

In other words, for each thread that has received all data completions (and is therefore ready to run), thread control circuitry 330 (which generally includes various state machines) checks completion table 422 to determine if the thread is the next thread to run (with a next thread ID, e.g., TID 4), if so, moves the thread (TID 4) into run queue 440, if not, does not start the thread (e.g., a thread whose data completion count becomes zero but has a TID 5), but maintains an index of its TID to start next. When the data completion of a thread with the next TID decrements to zero (in this case TID 4) and thus all completion messages have arrived, the thread queues up for execution and the thread to execute (TID 4) also has the next TID, in which case its next TID is TID 5. Thus, when the thread with TID of 4 has completed, the thread control circuitry 330 checks the completion table 422, and has now determined that the thread (TID of 5) is the next thread ID, and queues the thread for execution. When the thread ID is the last TID, after its execution, an AF-done message (656) can be transmitted to the post-loop base tile (210E in this case). It should be noted that this use of additional fields in completion table 422 may be extended to any situation in which a particular ordering of thread execution should be maintained.

Fig. 24 is a circuit block diagram of a representative embodiment of conditional branch circuitry 370. Synchronization domains, such as the first, second, and third synchronization domains mentioned above, are a set of interconnected tiles that are connected in sequence or in series through the synchronous mesh communication network 275. Execution of the thread begins at the first tile 210 of the synchronization domain, referred to as the base tile 210, and proceeds from there to the other tiles 210 of the synchronization domain through the configured connections of the synchronous mesh communication network 275. As shown in FIG. 24, when tiles 210 have been configured as base tiles 210 for the synchronization domain (those configurations having been loaded to HTF circuit 200 before runtime), selection 374 of configuration memory multiplexer 372 is set equal to 1, thereby selecting spoke RAM 320 to provide an instruction index to select an instruction from instruction RAM 315. For all other tiles 210 of the synchronization domain, selection 374 of configuration memory mux 372 is set equal to 0, thereby selecting the instruction index provided by the previous tile 210 in the sequence of tiles 210 of the synchronization domain. Thus, base tile 210 provides the instruction index (or instruction) to be executed to the next second tile of the domain through the designated fields (or portions) of communication lines (or wires) 270B and 270A, which have been designated by the master synchronization input, as mentioned above. By default, this next tile 210 and each subsequent tile 210 of the synchronization domain will then provide the same instruction as a static configuration to each next tile 210 of the connected tiles 210 for execution.

However, in the representative embodiment, a mechanism is provided for dynamic self-configuration using spoke RAM 320, instruction RAM 315, and conditional branch circuitry 370. Referring to FIG. 24, for current tile 210, ALB Op 310 may be configured to generate an output that is the result of the test condition, e.g., whether one input is greater than a second input. The test condition output is provided to the conditional branch circuitry 370 over a communication line (or conductor) 378. When the conditional branch circuitry 370 enables (by providing one or more bits of an instruction on line (or wire) 379), the test condition output is used to select the next instruction index (or instruction) provided to the next tile 210 of the synchronization field to select between an "X" instruction or a "Y" instruction for the next tile 210, providing a conditional branch of the data path when either the first or second instruction is selected. Such conditional branches may also be cascaded, for example, when the next tile 210 is also enabled to provide a conditional branch. Dynamic self-configuration and self-reconfiguration are enabled in each such HTF circuit group 205 by selecting the next instruction for one or more of the subsequent tiles 210.

In a representative embodiment, the conditional branch circuitry 370 has been arranged to select or switch between two different instructions depending on the test condition results. The branch enable is provided in a field in the current (or current next) instruction and is provided to the AND gate 362 of the conditional branch circuitry 370, where it is ANDed with the test condition output. Depending on whether the test condition output is a logic "0" or "1", and gate 362 will generate a logic "0" or "1" as an output, which is provided as an input to or gate 364. The other designated bit of the selected field of the current next instruction index, which is typically the least significant bit ("LSB") of the next instruction index, is also provided to OR gate 364, where it is OR' ed with the output of AND gate 362. If the LSB of the next instruction index is zero and it is ORed with a logic "1" of the output of AND gate 362, then the next instruction index of the output has been incremented by one, providing a different next instruction index to next tile 210. If the LSB of the next instruction index is zero and it is ORed with a logic "0" of the output of AND gate 362, then the next instruction index of the output has not been incremented by one, providing the same next instruction index to next tile 210. Thus, the current tile 210 conditionally specifies an alternative instruction for execution by the connected tile 210, enabling execution of one or more case statement(s) in the HTF circuit group 205. The substitute instruction is selected by causing the data path of the current tile to generate a boolean condition value and using the boolean value to select between the instruction of the current tile and the substitute instruction provided as the next instruction index to the next tile 210 in the synchronization domain. And, as a result, the current tile 210 has dynamically configured the next tile 210, and so on, to enable dynamic self-configuration and self-reconfiguration in each HTF circuit group 205.

Iv. mixed-thread processor 300:

fig. 25 is a high-level block diagram of a representative embodiment of a hybrid thread processor ("HTP") 300. Fig. 26 is a detailed block diagram of a representative embodiment of the thread storage memory 720 (also referred to as thread control memory 720) of the HTP 300. Fig. 27 is a detailed block diagram of a representative embodiment of the network response store 725 of the HTP 300. Fig. 28 is a detailed block diagram of a representative embodiment of the HTP 300. Fig. 29 is a flow diagram of a representative embodiment of a method for self-scheduling and thread control of the HTP 300.

The HTP300 generally includes one or more processor cores 705, which may be any type of processor core, such as a RISC-V processor core, an ARM processor core, and the like, all by way of example and not limitation. Core control circuitry 710 and core control memory 715 are provided for each processor core 705, and are shown in FIG. 25 for one processor core 705. For example, when multiple processor cores 705 are implemented, such as in one or more HTPs 300, a corresponding plurality of core control circuits 710 and core control memories 715 are also implemented, where each core control circuit 710 and core control memory 715 is used to control a corresponding processor core 705. In addition, one or more of the HTPs 300 may also include data path control circuitry 795 for controlling the size of accesses (e.g., load requests of the memory 125) through the first interconnection network 150 to manage potential congestion of the data path.

Core control circuitry 710, in turn, includes control logic and thread selection circuitry 730 and network interface circuitry 735. The core control memory 715 includes a plurality of registers or other memory circuits, conceptually divided and referred to herein as a thread memory (or thread control memory) 720 and a network response memory 725. For example, and without limitation, the thread storage 720 contains a plurality of registers for storing information related to thread status and execution, while the network response storage 725 contains a plurality of registers for storing information related to data packets transmitted to and from the first memory 125 over the first interconnection network 150, such as requests to read or store data to the first memory 125.

Referring to FIG. 26, thread storage 720 includes a plurality of registers, including: a thread ID pool register 722 (storing a predetermined number of thread IDs that may be utilized and are typically populated with identification numbers 0 through 31, such as, but not limited to, for a total of 32 thread IDs when the system 100 is configured); thread status (table) registers 724 (store thread information such as valid, idle, suspend, wait for instruction, first (normal) priority, second (low) priority, temporary change of priority when resources are not available); a program counter register 726 (e.g., storing the address or virtual address in instruction cache 740 where the thread starts next); general purpose registers 728 for storing integer and floating point data; a pending fibre return count register 732 (which tracks the number of pending threads that will return to complete execution); a return argument buffer 734 ("RAB", e.g., a header RAB that is a header with a linked list of return argument buffers), a thread return register 736 (e.g., storing a return address, a call identifier, any thread identifier associated with the calling thread); custom atomic transaction identifier register 738; a received event mask register 742 (used to specify which events to "listen" to, as discussed in more detail below); an event status register 744; and a data cache 746 (typically 4 to 8 cache lines of cache memory are provided for each thread). All of the different registers of thread memory 720 are indexed using the assigned thread ID of a given or selected thread.

Referring to fig. 27, network response memory 725 contains a plurality of registers, such as memory request (or command) register 748 (e.g., a command to read, write, or perform a custom atomic operation); a thread ID and transaction identifier ("transaction ID") register 752 (where the transaction ID is used to track any requests to memory and associate each such transaction ID with the thread ID of the thread that generated the request to memory 125); a request cache line index register 754 (used to specify which cache line in the data cache 746 to write when data is received from the memory of a given thread (thread ID)); register byte register 756 (specifying the number of bytes written to general register 728); and general register index and type registers 758 (indicating which general register 728 to write and whether it is a sign extension or floating point).

As described in more detail below, the HTP300 will receive a job descriptor packet. In response, the HTP300 will find an idle or empty context and initialize a context block, assigning a thread ID to the executing thread (collectively referred to herein as a "thread") if one is available, and placing the thread ID in the execution (i.e., "ready to run") queue 745. The threads in the execution (ready-to-run) queue 745 are typically selected for execution during a round robin or "barrel" selection process, where a single instruction of a first thread is provided to the execution pipeline 750 of the processor core 705, the single instruction of the second thread is then provided to the execution pipeline 750, and the single instruction of the third thread is then provided to the execution pipeline 750, the next thread's single instruction is then provided to the execution pipeline 750, and so on, until all threads in the execution (ready-to-run) queue 745 have provided the corresponding instruction to the execution pipeline 750, at which point the thread selection begins again, where the next instruction to execute (ready to run) the first thread in the queue 745 is provided to the execution pipeline 750, the next instruction for the second thread is then provided to the execution pipeline 750, and so on, looping through all threads of the execution (ready-to-run) queue 745. This execution will continue for each such thread until execution of the thread has completed, e.g., by executing a thread return instruction, at which point the response packet (with the thread execution results) is transmitted back to the source of the work descriptor packet, i.e., back to the source of the work descriptor call packet. Additionally, in representative embodiments and as discussed in more detail below, the execution (ready-to-run) queue 745 optionally has different levels of priority, shown as a first priority queue 755 and a second (lower) priority queue 760, where execution of threads in the first priority queue 755 proceeds more frequently than execution of threads in the second (lower) priority queue 760.

Thus, the HTP300 is an "event-driven" processor and will automatically begin thread execution upon receipt of a job descriptor packet (provided that a thread ID is available, but without any other requirement to initiate execution), i.e., arrival of a job descriptor packet automatically triggers the start of local thread execution without reference to the memory 125 or additional requests to the memory 125. This is extremely useful because the response time to start executing many threads (e.g., thousands of threads) in parallel is quite low. The HTP300 will continue thread execution until thread execution is complete, or it waits for a response upon which the thread will enter a "paused" state, as discussed in more detail below. Several different suspend states are discussed in more detail below. After receiving the response, the thread returns to the active state, in which the thread resumes execution with its thread ID returned to the execution (ready to run) queue 745. This thread execution control is performed in hardware by control logic and thread selection circuitry 730 in conjunction with thread state information stored in thread memory 720.

In addition to host processor 110 generating work descriptor packets, HTP300 may also generate and transmit work descriptor packets to initiate work as one or more computing threads on another computing resource (e.g., another HTP300 or any HTF circuit 200). Such a job descriptor packet is a "call" job descriptor packet and generally includes a source identifier or address of the host processor 110 or HTP300 that is generating the call job descriptor packet, a thread ID (e.g., a 16-bit call Identifier (ID)) for identifying or causing a return to be associated with the original call, a 64-bit virtual core address (as a program count for locating the first instruction to begin executing the thread, typically stored in the instruction cache 740 of the HTP300 (or HTF circuit 200), which may also be a virtual address space), and one or more call arguments, e.g., up to four call arguments.

Similarly, when the thread has completed, the HTP300 or HTF circuit 200 generates another work descriptor packet, referred to as a "return" work descriptor packet, which is generally created when the HTP300 or HTF circuit 200 executes the last instruction of the thread (referred to as a return instruction), where the return work descriptor packet is assembled by the packet encoder 780, as discussed below. The return packet will address back to the source (using the identifier or address provided in the calling work descriptor packet), the thread ID (or call ID) from the calling work descriptor packet (to allow the source to correlate the return with the issued call, particularly when multiple calls have been generated by the source and are pending at the same time), and one or more return values (as a result), such as up to four return values.

Fig. 28 is a detailed block diagram of a representative embodiment of the HTP 300. For ease of illustration and discussion, it should be noted that not all registers of the thread memory 720 and the network response memory 725 are shown in FIG. 28. Referring to fig. 28, core control circuitry 710 includes control logic and thread selection circuitry 730 and network interface circuitry 735. For example, and without limitation, control logic and thread selection circuitry 730 includes circuitry formed using any one of a number of different logic gates (e.g., "NAND", "NOR", "AND", "OR", "XOR", etc.) in combination with different state machine circuits (control logic circuit 731, thread selection control circuitry 805), as well as multiplexers (e.g., input multiplexer 787, thread selection multiplexer 785). The network interface circuitry 735 includes: an AF input queue 765 for receiving data packets (including job descriptor packets) from the first interconnection network 150; an AF output queue 770 for passing data packets (including job descriptor packets) to the first interconnection network 150; a data packet decoder circuit 775 for decoding incoming data packets from the first interconnect network 150, retrieving the data (in designated fields) and passing the data provided in the packets to the thread memory 720 and the associated registers of the network response memory 725 (in conjunction with the thread ID assigned to the thread by the control logic and thread selection circuitry 730, which also provides or forms an index to the thread memory 720, as discussed in more detail below); and a packet encoder circuit 780 for encoding outgoing packets (e.g., for requests to memory 125, using the transaction ID from the thread ID and transaction identifier ("transaction ID") register 752) for transmission over the first interconnection network 150. The packet decoder circuit 775 and the packet encoder circuit 780 may each be implemented as a state machine or other logic circuitry. Depending on the selected embodiment, there may be a separate core control circuit 710 and a separate core control memory 715 for each HTP processor core 705, or a single core control circuit 710 and a single core control memory 715 may be used for multiple HTP processor cores 705.

When a work descriptor packet arrives, the control logic and thread selection circuitry 730 assigns an available thread ID from the thread ID pool register 722 to the thread of the work descriptor packet, with the assigned thread ID used as an index to other registers in the thread memory 720, which registers are then populated with corresponding data from the work descriptor packet, typically a program count and one or more arguments. For example, and without limitation, in preparation for a start thread to execute instructions, the control logic and thread selection circuitry 730 autonomously initializes the rest of the thread background state, such as the load data cache register 746 and the load thread return register 736. As another example, an execution thread has a main memory stack space and a main memory background space. The background space is only used when the state of a thread needs to be written to memory to be accessed by the host. Each HTP300 processor core 705 is initialized with a core stack base address and a core background base address, where the base addresses point to a stack block and a background space block. The thread stack base address is obtained by obtaining a core stack base address and adding a thread ID multiplied by the thread stack size. Thread context base addresses are obtained in a similar manner.

The thread ID is given a valid status (indicating that it is ready for execution) and the thread ID is pushed to the first priority queue 755 of the execution (ready-to-run) queue 745, since the thread is typically assigned the first (or normal) priority. Selection circuitry, e.g., multiplexer 785, of control logic and thread selection circuitry 730 selects the next thread ID in execution (ready to run) queue 745, which is used as an index into thread memory 720 (program count register 726 and thread status register 724) to select instructions from instruction cache 740 that are then provided to execution pipeline 750 for execution. The execution pipeline then executes the instruction.

Upon completion of executing the instruction, the same triplet information (thread ID, valid status, and priority) may be returned to the execution (ready to run) queue 745 under the control of the control logic and thread selection circuitry 730, depending on various conditions, to continue selection for round robin execution. For example, if the last instruction of the selected thread ID is a return instruction (indicating that thread execution is complete and a return data packet is provided), control logic and thread selection circuitry 730 returns the thread ID to the available thread ID pool in thread ID pool register 722 for use by a different thread. As another example, a valid indicator may change, such as changing to a paused state (e.g., while a thread may wait for information to return or write from memory 125 to memory 125 or for another event), in which case the thread ID (now having a paused state) is not returned to the execution (ready to run) queue 745 until the state changes back to valid.

Continuing with the previous example, when the last instruction of the thread ID is the return instruction, the return information (thread ID and return argument) is pushed through the execution pipeline 750 to the network command queue 790, which network command queue 790 is typically implemented as a first-in-first-out (FIFO). The thread ID is used as an index into the thread return register 736 to obtain return information, such as a transaction ID and source (caller) address (or other identifier), and the packet encoder circuit then generates an outgoing return data packet (on the first interconnect network 150).

Continuing with the latter example, the instructions of the thread may be load instructions, i.e., read requests to the memory 125, which are then pushed through the execution pipeline 750 to the network command queue 790. The packet encoder circuit then generates (on the first interconnect network 150) an outgoing data packet with a request (read or write request) to memory 125, including the request size and assigned transaction ID (from the thread ID and transaction ID register 752, which also serves as an index to the network response memory 725), the address of the HTP300 (as the return address for the requested information). When a packet is received from the first interconnect network 150 and decoded, the transaction ID is used as an index into the network response memory 725, the thread ID of the requesting thread is obtained, it also provides a location in the data cache 746 where the data returned in the response is written, then the transaction ID is returned to the thread ID and transaction ID register 752 for reuse, and the state of the corresponding thread ID is set to valid again, and the thread ID is pushed again to the execution (ready to run) queue 745 to resume execution.

A store request to memory 125 is performed in a similar manner, where the outgoing packet also has data to be written to memory 125, an assigned transaction ID, the source address of HTP300, and where the return packet is an acknowledgement with the transaction ID. The transaction ID is then also returned to the thread ID and transaction ID register 752 for reuse, and the state of the corresponding thread ID is again set to valid and the thread ID is again pushed to the execution (ready to run) queue 745 to resume execution.

Fig. 29 is a flowchart of a representative embodiment of a method for self-scheduling and thread control of an HTP300, and provides a useful overview, in which the HTP300 has been populated with instructions in an instruction cache 740 and a predetermined number of thread IDs in a thread identifier pool register 722. The method begins at step 798 after receiving a job descriptor packet. At step 802, the work descriptor packet is decoded, and at step 804, the various registers of the thread memory 720 are filled with information received in the work descriptor packet, thereby initializing the background block. When a thread ID is available at step 806, the thread ID is assigned at step 808 (if a thread ID is not available at step 806, then the thread will wait until a thread ID becomes available at step 810). At step 812, valid status is initially assigned to the thread (along with any initially assigned priority, e.g., first or second priority), and at step 814, the thread ID is provided to the execution (ready to run) queue 745. Next, at step 816, the thread ID in the execute (ready-to-run) queue 745 is selected for execution (at a predetermined frequency, as discussed in more detail below). Using the thread ID, thread memory 720 is accessed and a program count (or address) is obtained at step 818. At step 820, instructions corresponding to the program count (or address) are obtained from the instruction cache 740 and provided to the execution pipeline 750 for execution.

When thread execution is complete, step 822, i.e., when the instruction being executed is a return instruction, the thread ID is returned to the thread ID pool register 722 for reuse by another thread, step 824, the registers of the thread memory 720 associated with the thread ID may be cleared, step 826, (optionally), and thread control of the thread may end, returning to step 834. When thread execution is not complete in step 822, and when the thread state remains valid in step 828, the thread ID (and its valid state and priority) is returned to the execution (ready to run) queue 745, returning to step 814 to continue execution. When the thread state is no longer valid (i.e., thread suspended) in step 828, where the suspended state of the thread ID is indicated in thread memory 720, execution of the thread is suspended in step 830 until the state of the thread ID returns to valid and the thread ID (and its valid state and priority) returns to the execution (ready to run) queue 745 in step 832, returning to step 814 for continued execution.

Similarly, the HTP300 may generate calls to create threads on local or remote computing elements to create threads on other HTPs 300 or HTF circuits 200. Such calls are also created as outgoing data packets, and more specifically, as outgoing job descriptor packets on the first interconnection network 150. For example, the instructions of the current thread being executed may be "shred creation" instructions (stored as possible instructions in instruction cache 740) to cause multiple threads to execute on various computing resources. As discussed in more detail below, such a fiber creation instruction specifies what computing resources (using an address or virtual address (node identifier)) are to execute the thread and also provides associated arguments. When a shred create instruction is executed in the execution pipeline 750, the shred create command is pushed into the network command queue 790 and the next instruction is executed in the execution pipeline 750. The command is pulled out of the network command queue 790 and the packet encoder circuit 780 has the information needed to create and send a work descriptor packet to the specified destination HTF200 or HTP 300.

Such instructions will also allocate and reserve the associated memory space, such as in return argument buffer 734, if the created thread will have a return argument. If there is insufficient space in return argument buffer 734, the instruction will stall until return argument buffer 734 is available. The number of fibers or threads created is limited only by the amount of space to hold the response arguments. Creating threads without return arguments may avoid reserving return argument space, thereby avoiding a possible suspended state. This mechanism ensures that returns from completed threads always have locations to store their arguments. When returning back to the HTP300 as a data packet on the first interconnection network 150, those packets are decoded, as discussed above, with the return data stored in the associated reserved space in the return argument buffer 734 of the thread memory 720, which is indexed by the thread ID associated with the shred creation instruction. Because many registers are available for return arguments, return argument buffer 734 may be provided as a linked list of all generated threads or as a return argument buffer or register assigned for the thread ID. Notably, this mechanism may make it possible to create thousands of threads extremely quickly, effectively minimizing the time involved in transitioning from single thread execution to high thread count parallelism.

As discussed in more detail below, various types of shred join instructions are utilized to determine when all spawned threads are complete, and may be instructions with or without a wait. A count of the number of spawned threads is held in the pending fiber return count register 732, which is decremented when the HTP300 receives a thread return. The join operation may be implemented by copying the return into a register associated with the generated thread ID. If the join instruction is a wait instruction, it will remain in a halted state until the return of the thread ID specifying the spawned thread arrives. During this time, other instructions are executed by the execution pipeline 750 until the joined instruction's stalled state becomes active and the joined instruction returns to the execution (ready to run) queue 745.

The thread return instruction may also be used as an instruction after the shred creation instruction, rather than the join instruction. A thread return instruction may also be executed when the count in the pending fibre return count register 732 reaches zero and the last thread return packet is received, and indicates that the fibre create operation has completed and all returns have been received, allowing the thread ID, return argument buffer 734 and linked list to be freed for other purposes. In addition, it may also generate and transmit a work descriptor return packet (e.g., with result data) to a source called the primary thread (e.g., an identifier or address of the source that generated the call).

All join instructions do not need to return arguments as long as an acknowledgement is made that the count in the pending fibre return count register 732 is decremented. When the count reaches zero, the thread restarts because all joins are now complete.

Communication between processing elements is required to facilitate the processing of parallel algorithms. The representative embodiments provide an efficient means for threads of a set of processing resources to communicate using various event messages, which may also include data (e.g., arguments or results). Event messaging allows any host processor 110 with hardware-maintained cache coherency and any acceleration processor (e.g., HTP 300) with software-maintained cache coherency to efficiently participate in event messaging.

Event messaging supports point-to-point and broadcast event messages. Each processing resource (HTP 300) may determine when a received event operation is complete and the processing resource should be notified. The event reception mode includes simple (single received event completes the operation), collecting (counters are used to determine when enough events have been received to complete the operation), and broadcasting (event complete events received on a particular channel). Additionally, an event may be sent with an optional 64-bit data value.

The HTP300 has a set of event receipt status consisting of a 2-bit receipt mode, a 16-bit counter/channel number, and a 64-bit event data value, stored in the event status register 744. The HTP300 may have multiple sets of event receipt status for each thread context, with each set indexed by an event number. Thus, events may be for a particular thread (thread ID) and event number. The sent event may be a point-to-point message with a single destination thread or a broadcast message sent to all threads within a set of processing resources belonging to the same process. When such an event is received, the suspended or sleeping thread may be reactivated to resume processing.

This use of the event status register 744 is much more efficient than a standard Linux-based host processor that can send and receive events over an interface that allows the host processor 110 to periodically poll for completed receive events. Threads waiting for event messages may suspend execution until the receive operation is complete, i.e., rather than wasting resources due to polling, HTP300 may suspend execution of threads pending for completion of the receive event, allowing other threads to execute during these intervals. Each HTP300 also maintains a list of processing resources that should participate in receiving events to avoid process security issues.

The point-to-point message will specify the event number and destination (e.g., node number, which HTP300, which core, and which thread ID). On the receive side, the HTP300 will be configured or programmed with one or more event numbers that are saved in the event status register 744. If the HTP300 receives the event information with the event number, it is triggered and transitions from the halt state to the active state to resume execution, e.g., execute the event received instruction (e.g., EER, infra). The instruction will then determine whether the correct event number was received and, if so, write any associated 64-bit data into the general register 728 for use by another instruction. If the event has received instruction execution and has not received the correct event number, it will pause until the particular event number is received.

An event listen (EEL) instruction may also be utilized, where an event mask is stored in the event received mask register 742 indicating one or more events to be used to trigger or wake a thread. When event information arrives with any of those specified events, the receiving HTP300 will know which event number triggered, e.g., which other process may have completed, and will receive event data from those completed events. The event listen instruction may also have a wait and no wait change, as discussed in more detail below.

For event messaging in the gather mode, the receiving HTP300 will gather (wait for) a set of receive events before triggering, setting the count in the event status register 744 to the required value, which is decremented when the required event message is received, and triggers when the count decrements to zero.

In the broadcast mode, the sender processing resource may transmit a message to any thread within the node. For example, a sending HTP300 may transmit a series of point-to-point messages to every other HTP300 within a node, and then each receiving HTP300 passes the message to each internal core 705. Each core control circuit 710 will examine its thread list to determine whether it corresponds to the event number it was initialized to receive and to determine which lane may have been designated on the first interconnection network 150.

This broadcast mode is particularly useful when thousands of threads can be executed in parallel, with the last thread executed transmitting broadcast event information indicating completion. For example, a first count of all threads that need to complete may be saved in the event status register 744, while a second count of all threads that have already executed may be saved in the memory 125. As each thread executes, it also performs a get and increment atomic operation on the second count, such as an atomic operation through memory 125 (and comparing it to the first count), and sets its mode to receive broadcast messages by executing an EER instruction that waits until it receives a broadcast message. The last thread to execute sees the obtained value of the second count as the required first count minus one, indicating that it is the last thread to execute, and therefore sends a broadcast message, which is a very fast and efficient way to indicate the completion of a large amount of parallel processing.

As mentioned above, while the HTP300 may utilize standard RISC-V instructions, it is noted that an extended instruction set may be provided to utilize all of the system 100's computing resources, as discussed in more detail below. The thread created from the host processor 110 is generally referred to as the primary thread and the thread created from the HTP300 is generally referred to as a fiber or fiber thread, all of which execute identically on the destination HTP300 and HTF200, without going through the memory 125.

The new load instruction:

the HTP300 has a relatively few read/write buffers, also referred to as data cache registers 746, for each thread. A buffer (data cache register 746) temporarily stores shared memory data for use by its own threads. The data cache 746 is managed by a combination of hardware and software. The hardware automatically allocates buffers and reclaims data as needed. Using RISC-V instructions, software determines which data should be cached (read and write data), and when the data cache register 746 should be invalidated (if valid) or written back to memory (if invalid). The RISC-V instruction set provides the FENCE instruction and fetches and releases indicators on the atomic instructions.

The standard RISC-V load instruction automatically uses the read data cache register 746. The standard load checks to determine if the required data is in the existing data cache register 746. If so, then data is obtained from the data cache register 746 and the execution thread can continue execution without suspension. If the desired data is not in the data cache register 746, the HTP300 looks up the available data cache register 746 (data needs to be evicted from the buffer) and reads 64 bytes from memory into the data cache register 746. The execution thread is suspended until the memory read is complete and the load data is written into the RISC-V register.

Read buffering has two main benefits: 1) greater access to the memory controller 120 is more efficient, and 2) access to the buffer allows the executing thread to not stall. However, there are cases where the use of the buffer causes problems. An example is a gather operation, where an access typically causes thrashing of the data cache 746. For this reason, a special set of load instructions is provided, forcing the load instructions to check for cache hits, but on a cache miss, only a memory request is issued for the requested operand and the obtained data is not placed in the data cache register 746, but rather in one general purpose register 728.

The new load instructions provide "probabilistic" caching based on expected access frequency for frequently used data versus infrequently or infrequently used data. This is particularly important when used with sparse data sets, which if placed into the data cache register 746 will also overwrite other data needed more frequently, effectively polluting the data cache register 746. The new load instruction (NB or NC) allows frequently used data to be held in the data cache registers 746, while infrequently used (sparse) data that would normally be cached is actually designated for uncached storage in the general purpose registers 728.

This type of instruction has an NB suffix (unbuffered) (or equivalently, an NC suffix (unbuffered):

LB.NB RA,40(SP)。

NB (NC) load instructions are intended for writing to the runtime library in the program set.

In Table 8, the following load instructions are added as 32-bit instructions, where Imm is the immediate field, RA is the register name, rs1 is the source index, rd is the destination index, and the bits in fields 14-12 and 6-0 specify the instruction.

Table 8:

bandwidth to memory is often the primary factor limiting application performance. The representative embodiments provide a means to inform the HTP300 about the size of the memory load request that should be issued to the memory 125. Since accessing memory data is not used by the application, the representative embodiments reduce the waste of memory and the bandwidth of the first interconnection network 150.

There is another optimization where the application is aware of the size of the data structure accessed and may specify the amount of data to load into the data cache register 746. As an example, if the algorithm uses a 16 byte size structure, and the structure is scattered in memory, it would be optimal to issue a 16 byte memory read and place the data into the data cache register 746. The representative embodiment defines a set of memory load instructions that provide the size of the operands to be loaded into the HTP300 registers and the size of the access to memory in the event of a load miss to the data cache register 746. The actual load of the memory 125 may be less than the size specified by the instruction if the memory access crosses a cache line boundary. In this case, the access size is reduced to ensure that the response data is written to a single cache line of the data cache register 746.

When the requested data is less than a cache line, the load instruction may also request additional data that is not currently needed by the HTP300 but may be needed in the future, which is worth obtaining at the same time (e.g., as a prefetch), optimizing read size access to the memory 125. This instruction may also overwrite (as discussed in more detail below, with reference to fig. 32) any reduction in access size that may have been used for bandwidth management.

Thus, the representative embodiments minimize wasted bandwidth by requesting only memory data that is known to be needed. The result is an increase in application performance.

A set of load instructions has been defined that allow specifying the amount of data to be accessed. Data is written to the buffer and invalidated by eviction, FENCE, or atom with specified fetch. The load instruction provides a hint as to how much additional data (in 8 byte increments) will be accessed from memory and written to the memory buffer. The payload will only access the extra data for the next 64 byte boundary. The load instruction specifies the number of additional 8 byte elements that are loaded using the operation suffixes RB0-RB 7:

LD.RB7 RA,40(SP)

the instruction format is shown in table 9. The number of 8 byte data elements to be loaded into the buffer is specified as bits 6 and 4:3 in the 32-bit instruction. These load instructions may be used to assemble a write routine or, ideally, assemble by a compiler. It is expected that a set of programs that were initially written only manually will utilize these instructions.

Table 9:

new store instruction

The HTP300 has a small amount of memory buffer that temporarily stores shared memory data. The memory buffer allows multiple writes to memory to be consolidated into a smaller number of memory write requests. This has two benefits: 1) fewer write requests are more efficient for the first interconnect network 150 and the memory controller 120, and 2) the HTP300 suspends the thread that performs the memory store until the data is stored at the HTP300 store buffer or the memory controller 120. The storage to the HTP300 store buffer is extremely fast and does not typically cause the thread to suspend execution. When the buffer writes to the memory controller 120, the thread pauses until completion is received in order to ensure coherency of the memory 125.

Standard RISC-V store instructions write data to the HTP300 store buffer. However, there are situations where it is known that it is preferable to write data directly to memory and not to a memory buffer. One such situation is a scatter operation. Scatter operations typically write only a single data value to the memory buffer. Writing to the buffer thrashing the buffer (thrash) and forcing other stored data that is beneficial for writing the merge back to memory. A set of store instructions is defined for the HTP300 indicating that write buffering should not be used. These instructions write data directly to the memory 125, causing the execution thread to pause until the write is complete.

Store unbuffered instructions are expected for manually assembled libraries and indicated with an NB suffix:

ST.NB RA,40(SP)

the following store instructions are added as shown in table 10.

Table 10:

custom atomic store and Clear Lock (CL) instruction:

when the memory controller observes an atomic operation, the custom atomic operation sets a lock on the provided address. Atomic operations are performed on the associated HTP 300. When the lock should be cleared, the HTP300 should notify the memory controller. For custom atomic operations, this should be on the last store operation performed by the HTP300 (or, if no store is needed, on the shred terminate instruction). The HTP300 indicates that the lock is to be cleared by performing a special store operation. Store and clear lock instructions.

The following instruction sequences may be used to implement custom atomic DCAS operations:

// a 0-atomic Address

64-bit memory value of// a1-a0

// a2-DCAS comparison 1

// a3-DCAS comparison 2

// a4-DCAS exchange value 1

// a5-DCAS exchange value 2

atomic_dcas:

bne a1, a2, fail// first 8 byte comparison

Nb a6,8(a0)// load second 8 byte memory value-should hit in memory cache

bne a6, a3, fail// second 8 byte comparison

sd a4,0(a0)// store the first 8-byte swap value to the thread store buffer

Cl a5,8(a0)// store the second 8 byte value and clear the memory lock

eft x0// AMO successful response

fail:

li a1,1

Cl a1, (a0)// AMO fail response (and clear memory lock)

The store instruction indicating that the lock should be cleared is:

SB.CL RA,40(SP)

SH.CL RA,40(SP)

SW.CL RA,40(SP)

SD.CL RA,40(SP)

FSW.CL RA,40(SP)

FSD.CL RA,40(SP)

the format of these store instructions is shown in table 11.

Table 11:

atomic_float_add:

d.d a2, a1, a2// a1 contain memory values, a2 contain the value to be added

fsd. cl a2,0(a0)// a0 contains memory address, clears lock and terminates atom

eft// evicting all lines from the buffer, terminating the atomic thread

A thread creation instruction:

a fiber creation ("EFC") instruction initiates a thread on HTP300 or HTF 200.

EFC.HTP.A4

EFC.HTF.A4

This instruction executes a call on the HTP300 (or HTF200), starting execution at the address in register a 0. (optionally, a suffix: DA. instruction suffix DA may be utilized to indicate that the target HTP300 is determined by a virtual address in register a1 if the DA suffix is not present, then the target is HTP300 on the local system 100.) suffixes a1, a1, a2, and a4 specify the number of additional arguments to be passed to the HTP300 or HTF 200. The argument count is limited to a value of 0, 1, 2, or 4 (e.g., the packet should be able to fit 64B). The additional arguments come from register states (a2-a 5).

It should be noted that if the return buffer is not available at the time the instruction is executed, the EFC instruction will wait until the return argument buffer is available to begin execution. Once the EFC instruction successfully creates a shred, the thread continues at the instruction immediately following the EFC instruction.

It should also be noted that the thread created by host processor 110 is capable of executing an EFC instruction and creating a shred. Optionally, the shred created by the EFC instruction cannot execute the EFC instruction and generates an exception. The format of these fiber creation instructions is shown in table 12.

Table 12:

thread return instruction:

the thread return (ETR) instruction passes the argument back (either through thread creation by host processor 110 or fiber creation by HTP 300) to the parent thread that initiated the current thread. Once the thread completes the return instruction, the thread is terminated.

ETR.A2

This instruction execution returns to the HTP300 or the host processor 110. The ac suffix specifies the number of additional arguments to be passed to the HTP or host. The argument count may be the value 0, 1, 2 or 4. The arguments come from register states (a0-a 3). The depiction of these thread return instructions is shown in table 13.

Table 13:

a fiber adding instruction:

the fiber join (EFJ) instruction checks whether the created fiber has returned. The instructions have two variations: join wait and not wait. Waiting for the change will cause thread execution to pause until the fiber has returned. Joining does not wait to not suspend thread execution, but rather provides a success/failure status. For both variants, if the instruction is executed without a pending fibre return, an exception is generated.

The arguments from the return fibre (up to four) are written to registers a0-a 3.

EFJ

EFJ.NW

The format of these fibre join instructions is shown in table 14.

Table 14:

all fiber adding instructions:

all of the fibre join instructions (efj. all) are pending until all pending fibres are returned. The instruction may be invoked with zero or more pending fibre returns. Instruction states and exceptions are not generated. All return arguments from the fibre return are ignored.

EFJ.ALL

The format of these fibre join instructions is shown in table 15.

Table 15:

atomic return instruction:

an atomic return instruction (EAR) of system 100 is used to complete the execution thread of the custom atomic operation and possibly provide a response back to the source that issued the custom atomic request.

The EAR instruction may send zero, one, or two 8-byte argument values back to the issuing compute element. The number of arguments to be sent back is determined by the ac2 suffix (a1 or a 2). No suffix means zero argument, a1 means a single 8-byte argument, and a2 means two 8-byte arguments. Arguments are obtained from the X registers a1 and a2, if needed.

The EAR instruction is also capable of clearing a memory line lock associated with the atomic instruction. The EAR uses the value in the a0 register as the address to send the clear lock operation. If the instruction contains the suffix CL, a clear lock operation is issued.

The following DCAS instances use the EAR instruction to send back a success or failure to the requesting processor:

// a 0-atomic Address

64-bit memory value of// a1-a0

// a2-DCAS comparison 1

// a3-DCAS comparison 2

// a4-DCAS exchange value 1

// a5-DCAS exchange value 2

atomic_dcas:

bne a1, a2, fail// first 8 byte comparison

Nb a6,8(a0)// load second 8 byte memory value-should hit in memory cache

bne a6, a3, fail// second 8 byte comparison

sd a4,0(a0)// store the first 8-byte swap value to the thread store buffer

Cl a5,8(a0)// store the second 8 byte value and clear the memory lock

li a1,0

A1// AMO successful response

fail:

li a1,1

Cl. a1// AMO failure response (and clearing memory lock)

The instruction has two variations that allow the EFT instruction to also clear the memory lock associated with the atomic operation. The format of the supported instructions is shown in table 16.

Table 16:

first and second priority instructions:

the second (or low) priority instruction transitions the current thread having the first priority to a second low priority. The instruction is typically used when a thread polls for an event to occur (i.e., a barrier).

ELP

The format of the ELP instruction is shown in Table 17.

Table 17:

the first (or high) priority instruction transitions the current thread having the second (or low) priority to the first (or high or normal) priority. The instruction is typically used when a thread is polling and an event has occurred (i.e., a barrier).

ENP

The format of the ENP instruction is shown in Table 18.

Table 18:

floating point atomic memory operation:

floating point atomic memory operations are performed by HTP300 associated with memory controller 120. The floating point operations performed are MIN, MAX, and ADD for both 32 and 64 bit data types.

The aq and rl bits in the instruction specify whether all written data should be visible to other threads before the atomic operation is issued (aq) and whether all previously written data should be visible to this thread after the atomic operation is completed (rl). In other words, the aq bit forces all write buffers to be written back to memory, and the rl bit forces all read buffers to be invalidated. Note that rs1 is the X register value, and rd and rs2 are the F register values.

AMOFADD.S rd,rs2,(rs1)

AMOFMIN.S rd,rs2,(rs1)

AMOFMAX.S rd,rs2,(rs1)

AMOFADD.D rd,rs2,(rs1)

AMOFMIN.D rd,rs2,(rs1)

AMOFMAX.D rd,rs2,(rs1)

The format of these floating point atomic memory operation instructions is shown in table 19.

Table 19:

custom atomic memory operation:

the custom atomic operation is performed by HTP300 associated with memory controller 120. The operation is performed by executing a RISC-V instruction.

Up to 32 custom atomic operations may be used within memory controller 120 of system 100. Custom atoms are system-level resources that can be used to attach to any process on system 100.

The aq and rl bits in the instruction specify whether all written data should be visible to other threads before the atomic operation is issued (aq) and whether all previously written data should be visible to this thread after the atomic operation is completed (rl). In other words, the rl bit forces all write buffers to be written back to memory, and the aq bit forces all read buffers to be invalidated.

The custom atom specifies the memory address using the a0 register. The number of derived variables is provided by the suffix (a0, a1, a2, or a4) and is obtained from registers a1-a 4. The number of result values returned from memory may be 0-2 and is defined by the custom memory operation. The result value is written to registers a0-a 1.

AMOCUST0.A4

As shown in table 20, the following custom atomic instructions are defined.

Table 20:

the ac field is used to specify the number of arguments (0, 1, 2, or 4). Table 21 below shows the encoding.

Table 21:

eight custom atomic instructions are defined, with each custom atomic instruction having 4 argument count variations, resulting in a total of 32 possible custom atomic operators.

Event management:

the system 100 is an event-driven architecture. Each thread has a set of events that can be monitored using the event received mask register 742 and the event status register 744. Event 0 is reserved for returns from the created fiber (HTP 300 or HTF 200). The rest of the event may be used for event signaling, thread-to-thread, broadcast, or gather. Thread-to-thread allows a thread to send an event to a particular destination thread on the same or different node. The broadcast allows a thread to send named events to a subset of threads on its node. The receiving thread should specify the named broadcast event it expects. Collection refers to the ability to specify the number of events received before an event becomes active.

The event triggered bit may be cleared (using the EEC instruction) and all events may be listened (using the EEL instruction). The listening operation may suspend the thread until an event is triggered, or in a non-wait mode (. NW), allow the thread to periodically poll while other execution continues.

A thread can send an event to a particular thread using an event send instruction (EES) or broadcast an event to all threads within a node using an event broadcast instruction (EEB). A broadcast event is a named event in which the sending thread specifies an event name (16-bit identifier) and the receiving thread screens for a pre-specified event identifier in the received broadcast event. Once received, the event should be explicitly cleared (EEC) to avoid receiving the same event again. Note that all event-triggered bits are cleared when the thread begins execution.

An event mode instruction:

an event mode (EEM) instruction sets an operation mode of an event. Event 0 is reserved for thread return events, and the rest of the event can be in one of three receive modes: simple, broadcast, or gather.

In simple mode, a received event immediately sets the bit of the trigger and increments the received message count one by one. Each newly received event increments the received event count. The receive event instruction (EER) decrements the received event count one by one. The event-triggered bit is cleared when the count goes back to zero.

In the broadcast mode, the channel of the received event is compared with the broadcast channel of the event number. If the lanes match, then the event triggered bit is set. The EER instruction causes the triggered bit to be cleared.

In the collection mode, the event received causes the event trigger count to be decremented one by one. When the count reaches zero, the event triggered bit is set. The instruction causes the bit of the trigger to be cleared.

The EEM instruction prepares the event number for the selected mode of operation. In simple mode, the 16-bit event counter is set to zero. For broadcast mode, the 16-bit event channel number is set to the value specified by the EEM instruction. For gather mode, the 16-bit event counter is set to the value specified by the EEM instruction. Each of these three modes uses the same 16-bit value in a different manner.

Bm rs1, rs 2; rs1 event number, rs2 broadcast channel

Cm rs1, rs 2; rs1 event number and rs2 collection count

Sm rs 1; rs1 event number

The format of the event mode instruction is shown in table 22.

Table 22:

event destination instruction:

an event destination (EED) instruction provides an identifier of an event within an execution thread. The identifier is unique among all threads of execution within the node. The identifier may be used with an event send instruction to send an event to a thread using an EES instruction. The identifier is an opaque value that contains the information needed to send an event from a source thread to a particular destination thread.

The identifier may also be used to obtain a unique value for transmitting the broadcast event. The identifier contains a space of event numbers. The input register rs1 specifies the event number to encode within the destination thread identifier. After the instruction executes, the output rd register contains an identifier.

EED rd,rs1

The format of the event destination instruction is shown in table 23.

Table 23:

event destination instructions may also be used by a process to obtain its own address, which may then be used for other broadcast messages, e.g., to enable the process to receive other event messages as destinations, e.g., for receiving return messages when the process is the primary thread.

An event sending instruction:

an event issue (EES) instruction issues an event to a particular thread. Register rs1 provides the destination thread and event number. Register rs2 provides optional 8 bytes of event data.

EES rs1

EES.A1 rs1,rs2

The rs2 register provides the target HTP300 for the event send operation. Register rs1 provides the number of events to be sent. The normal value of rs1 is 2-7. The format of the event send instruction is shown in table 24.

Table 24:

event broadcast instructions:

an event broadcast (EEB) instruction broadcasts an event to all threads within a node. Register rs1 provides the broadcast channel (0-65535) to be sent. Register rs2 provides optional 8 bytes of event data.

EEB rs1

EEB.A1 rs1,rs2

The format of the event broadcast instructions is shown in table 25.

Table 25:

an event listening instruction:

an event listen (EEL) instruction allows a thread to monitor the status of a received event. The instructions may operate in one of two modes: wait and not wait. The wait mode suspends the thread until an event is received, and the wait mode provides the received event while executing the instruction.

EEL rd,rs1

EEL.NW rd,rs1

Register rs1 provides a mask of available events as the output of the listen operation. If there are no events available, then the wait-not mode returns a value of zero in rs 1. The format of the event listen command is shown in table 26.

Table 26:

an event receiving instruction:

an event receive (EER) instruction is used to receive an event. Receiving an event includes confirming that an event was observed and receiving optional 8 bytes of event data. Register rs1 provides the event number. Register rd contains the optional 8 bytes of event data.

EER rs1

EER.A1 rd,rs1

The format of the event reception instruction is shown in table 27.

Table 27:

the HTP300 instruction format is also provided for call, fork, or pass instructions, as previously discussed.

Sending a calling instruction:

the thread sends a call instruction to initiate a thread on the HTP300 or HTF200 and suspends the current thread until the remote thread performs a return operation:

HTSENDCALL.HTP.DA Ra,Rb,Args.

the thread sends a call instruction to execute a call on the HTP300, beginning execution at an address in register Ra. The instruction suffix DA indicates that the target HTP300 is determined by the virtual address in register Rb. If the DA suffix is not present, then the target is the HTP300 on the local node. The constant integer value Args identifies the number of additional arguments to be passed to the remote HTP 300. Args is limited to values of 0 to 4 (e.g., the package should be able to hold 64B). The additional arguments come from the register state. It should be noted that if the return buffer is not available at the time the HTSENDCALL instruction is executed, the HTSENDCALL instruction will wait until the buffer is available to begin execution. Once HTSENDCALL is complete, the thread is paused until a return is received. When a return is received, the thread is resumed at the instruction immediately following the HTSENDCALL instruction. The instructions send a packet for the first interconnection network 150 containing the following values, as shown in table 28:

table 28:

thread fork instruction:

the thread fork instruction initiates a thread on the HTP300 or HTF200 and continues with the current thread:

HTSENDFORK.HTF.DA Ra,Rb,Args.

the thread fork instruction executes a call on the HTF200 (or HTP 300), starting execution at the address in register Ra. The instruction suffix DA indicates that the target HTF200 is determined by the node ID within the virtual address in register Rb. If the DA suffix is not present, then the target is the HTF200 on the local node. The constant integer value Args identifies the number of additional arguments to be passed to the remote HTF. Args is limited to values of 0 to 4 (e.g., the package should be able to hold 64B). The additional arguments come from the register state. It should be noted that if the return buffer is not available when the HTSENDFORK instruction is executed, the HTSENDFORK instruction will wait until the buffer is available to begin execution. Once HTSENDFARK is complete, execution of the thread continues at the instruction immediately following the HTSENDFARK instruction. The thread fork instruction sends a packet for the first interconnection network 150 containing the following values, as shown in table 29:

table 29:

the thread passes the instruction:

the thread pass instruction initiates a thread on the HTP300 or HTF200 and terminates the current thread:

HTSENDXFER.HTP.DA Ra,Rb,Args.

the thread pass instruction performs a pass to the HTP300 and begins execution at the address in register Ra. The instruction suffix DA indicates that the target HTP300 is determined by the virtual address in register Rb. If the DA suffix is not present, then the target is the HTP300 on the local node. The constant integer value Args identifies the number of additional arguments to be passed to the remote HTP 300. Args is limited to values of 0 to 4 (the package must be able to hold 64B). The additional arguments come from the register state. Once htsendxml fer is complete, the thread is terminated. The thread pass instruction sends a packet for the first interconnection network 150 containing the following values, as shown in table 30:

table 30:

receiving a return instruction:

the thread receives a return instruction HTRECVRTN. The WT checks if a thread return has been received. If a WT suffix is present, then receiving a return instruction will wait until a return is received. Otherwise, the testable condition code is set to a state indicating an instruction. When a return is received, the returned argument is loaded into a register. The instruction immediately following the HTRECVRTN instruction is executed after the return instruction completes.

Fig. 30 is a detailed block diagram of a representative embodiment of the control logic of HTP300 and thread selection control circuitry 805 of thread selection circuitry 730. As mentioned above, a second or low priority queue 760 is provided and a thread ID is selected from the first (or high) priority queue 755 or the second or low priority queue 760 using the thread selection multiplexer 785 under the control of the thread selection control circuitry 805. The threads in second priority queue 760 are pulled from the queue and execute at a lower rate than the threads in first priority queue 760.

As mentioned above, a pair of instructions ENP and ELP are used to transition a thread from a first priority to a second priority (ELP) and vice versa.

A thread in a parallel application typically has to wait for other threads to complete the priority of resuming execution (i.e., barrier operations). The wait operation is accomplished by communication between threads. This communication may be supported by an event that wakes up a suspended thread or by waiting for a thread to poll the memory device. When threads poll, their work must be completed to allow all threads to resume high-yield execution resulting in wasted processing resources available to the threads. The second or low priority queue 760 allows waiting threads to enter a low priority mode, which reduces the overhead of polling threads. This serves to reduce the thread execution overhead of polling threads, so that threads that must do high productivity work occupy a large portion of the available processing resources.

The configuration register is used to determine the number of high priority threads to be run for each low priority thread, shown in FIG. 30 as a low priority skip count, which is provided to thread selection control circuitry 805, thread selection control circuitry 805 selecting threads from second priority queue 760 at predetermined intervals. As shown, thread selection control circuitry 805 decrements the skip count (register 842, multiplexer 844, and adder 846) until it equals zero (logic block 848), at which point the select input of thread selection multiplexer 785 switches to selecting a thread from second or low priority queue 760.

Fig. 32 is a detailed block diagram of a representative embodiment of data path control circuitry 795 of HTP 300. As mentioned above, one or more of the HTPs 300 may also include data path control circuitry 795 for controlling the size of accesses (e.g., load requests of the memory 125) through the first interconnection network 150 to manage potential congestion, providing adaptive bandwidth.

Application performance is typically limited by the bandwidth available to the processor in memory. Performance limitations may be alleviated by ensuring that only data needed by the application enters the HTP 300. Data path control circuitry 795 automatically (i.e., without user intervention) reduces the size of requests to main memory 125 to reduce the use of processor interfaces and subsystems of memory 125.

As mentioned above, the computing resources of system 100 may have many applications that use sparse data sets that frequently access small blocks of data distributed throughout the data set. Thus, if the amount of data accessed is significant, much of the data may be unused, resulting in wasted bandwidth. For example, a cache line may be 64 bytes, but it is not fully used. At other times, it may be beneficial to use all available bandwidth, for example, for efficient power usage. The data path control circuitry 795 provides dynamic adaptive bandwidth through the first interconnection network 150 to adjust the size of the data path load to optimize the performance of any given application, such as to reduce the data path load to 8-32 bytes (as an example) based on utilization of the receive (e.g., response) lane of the first interconnection network 150 back to the HTP 300.

The data path control circuitry 795 monitors the utilization level on the first interconnection network 150 and reduces the size of the memory 125 load (i.e., read) requests from the network interface circuitry 735 as the utilization level increases. In a representative embodiment, the data path control circuitry 795 performs time-averaged weighting of the utilization levels of the response channels of the first interconnection network 150 (time-averaged utilization block 764). If, after a fixed period of time (adjust interval timer 762), the utilization level is above the threshold (and the load request size is greater than the minimum) using threshold logic 766 (having multiple comparators 882 and select multiplexers 884, 886), the size of the load request is reduced by load request access size logic 768 (generally by a factor of 2 (e.g., 8 bytes) from threshold logic 766 using negative increment 892 so that (a) fewer packets 162 will be included in the series of packets 162 so that bandwidth can be used for routing of the packets to another location or for another process, or (b) utilization of memory 125 is more efficient (e.g., 64 bytes are not requested when only 16 bytes are utilized.) if, after a fixed period of time, the utilization level is below the threshold (and the load request size is less than the maximum) using threshold logic 766, the size of the load request is increased by the load request access size logic 368 using the positive increment 888, again generally by a factor of 2 (e.g., 8 bytes). The minimum and maximum values of the size of the load request may be configurable by the user, however, the minimum size is typically the size of the issued load instruction (e.g., the maximum operand size of the HTP300, e.g., 8 bytes) and the maximum size is the cache line size (e.g., 32 or 64 bytes). In an alternative embodiment, the data path control circuitry 795 may be located at the memory controller 120 to accommodate bandwidth pressure from multiple HTPs 300.

Fig. 33 is a detailed block diagram of representative embodiments of the system call circuitry 815 and the host interface circuitry 115 of the HTP 300. The representative system 100 embodiment allows user mode computing-only elements, such as HTP300, to perform system calls, breakpoints, and other privileged operations without running the operating system in order to open files, print, and the like. To do so, any of these system operations are initiated by the HTP300 executing a user mode instruction. The instruction execution of the processor identifies that the processor must forward the request to host processor 110 for execution. The system request from HTP300 is in the form of a system call work descriptor packet sent to host processor 110, and in response, HTP300 may receive a system call return work descriptor packet.

The system call work descriptor packet assembled and transmitted by the packet encoder 780 contains the system call identifier (e.g., thread ID, core 705 number, virtual address indicated by the program counter, system call arguments or parameters, which are typically stored in general purpose registers 728, and return information. the packet is sent to the host interface 115(SRAM FIFO 864), which host interface 115 writes to and queues the system call work descriptor packet in the main memory queue, such as the DRAM FIFO 866 shown in the main memory of the host processor 110, increments the write pointer, and the host interface 115 further sends an interrupt to the host processor 110 to cause the host processor 110 to poll the system call work descriptor packet in memory, the host processor's operating system access queue (DRAM FIFO 866) entry, perform the requested operation and place the return work descriptor data in the main memory queue (DRAM FIFO 868), and may signal the host interface 115. The host interface 115 monitors the status of the return queue (DRAM FIFO 868) and, when an entry exists, moves the data into the output queue (SRAM output queue 872), formats a return work descriptor packet with the provided work descriptor data, and sends the return work descriptor packet to the HTP300 that originated the system call packet.

The packet decoder 775 of the HTP300 receives the return job descriptor packet and places the returned arguments in the general purpose registers 728 as if the local processor (HTP 300) itself performed the operation. This transparent execution seen by applications running on user mode HTP300 enables the use of the same programming environment and runtime libraries that are used when the processor has a local operating system, and is well suited for use in various scenarios, such as program debugging, the use of inserted breakpoints.

However, the host interface 115 typically has limited FIFO space, which can be problematic when utilizing multiple HTPs 300, each HTP300 having a large number of cores (e.g., 96), where each core may run a large number of threads (e.g., 32 per core). To avoid adding a large amount of memory to the host interface 115, a system call credit mechanism is used for each HTP300 and each processor core 705 within the HTP300 to limit the total number of system calls that can be submitted.

Each processor core 705 includes a first register 852 as part of system call circuitry 815 for maintaining a first credit count. The system call circuitry 815 provided for each HTP300 includes a second register 858 that includes a second credit count as a pool of available credits. When generating a system call job descriptor packet, if there are sufficient credits available in the first register 852, the system call job descriptor packet may be transmitted, and if not, queued in the system call job descriptor packet table 862, possibly with other system call job descriptor packets from other processor cores 705 of the given HTP300 (via multiplexer 854). If there are sufficient credits available in the second register 858 to provide an additional pool of credits for system call bursts and shared among all processor cores 705 of the HTP300, the next system call job descriptor packet may be transmitted, otherwise it is saved in a table.

When those system call job descriptor packets are processed by the host interface 115 and read from the FIFO 864, the host interface 115 generates an acknowledgement back to the system call circuitry 815, causing the register 856 (shown as register 856) to be asserted₀And 856₁) May be incremented and then the first credit count in the first register 852 may be incremented for each processor core 705.

Alternatively, the register 856 may be used equivalent to the first register 852, without the need to have a separate first register 852 per core, but instead to hold the first count in the register 856, again for each core 705. As another alternative, for each core 705, all system call work descriptor packets may be queued in the system call work descriptor packet table 862 and transmitted when the core has sufficient first credit count in its corresponding register 856 or sufficient available credit in the second register 858.

A mechanism for thread status monitoring is also provided to collect the status of a set of threads running on the HTP300 in hardware, thereby enabling the programmer to see the work of the application. For example, if this feature is present, host processor 110 may periodically access and store information for subsequent use in generating user analysis reports. With this provided visibility, the programmer can change the application to improve its performance.

All thread state changes may be monitored and statistics may be saved with respect to the amount of time in each state. The processor (110 or 300) that is collecting the statistical data provides a means for a separate second processor (110 or 300) to access and store the data. When the application is running, data is collected such that reports showing the amount of time in each state of the periodic reports can be provided to the application analyst, thereby providing detailed visibility into the running application for subsequent use by the application analyst.

According to a representative embodiment, which may be implemented in hardware or software, all information related to a thread is stored in various registers of the thread storage 720 and may be regularly copied and saved in another location. A counter may be utilized to capture the amount of time any given thread spends in a selected state (e.g., a suspended state). For example, host processor 110 may record or capture the current state of all threads as well as a thread counter (the amount of time spent in a certain state), or the difference (delta) between the state and the count over time, and write it to a file or otherwise save it in memory. As another example, a program or thread may be a barrier in which all threads must complete before anything else can begin, and it is helpful to monitor what thread is in what state as it travels through the various barriers or changes states. The code shown (below) is an example of simulator code to be executed as or converted to hardware:

InStateCount[N]–6b
	InStateTimeStamp[N]–64b
InStateTotalTime[N]–64b

v. system memory and virtual addressing:

the system 100 architecture provides a partitioned global address space across all nodes within the system 100. Each node has a portion of the memory of the shared physical system 100. The physical memory of each node is partitioned into local private memory and globally shared distributed memory.

The local private memory 125 of a node is accessible by all computing elements within the node. The computing elements within the nodes participate in a hardware-based cache coherency protocol. The host processor 110 and the HTP300 each maintain a small data cache to speed up references to private memory. The HTF200 does not have a private memory cache (other than the memory 325 and the configuration memory 160), but instead relies on the memory subsystem cache to hold frequently accessed values. The read and write requests of the HTF200 are consistent at the time of access. The directory-based cache coherency mechanism ensures that the HTF200 read access obtains the most recently written value of the memory and that the HTF200 writes empty the invalidation cache and invalidate the shared processor cache before writing the HTF200 value to the memory.

The distributed shared memory of system 100 is accessible by all computing elements (e.g., HTF200 and HTP 300) within all nodes of system 100. The processing elements of system 100 do not have a cache for shared memory, but may have read/write buffers with invalidation/flushing controlled by software to minimize access to the same memory line. The RISC-V ISA provides fence (fence) instructions that can be used to indicate that a memory buffer invalidation/flushing is required. Similarly, the HTF200 supports write suspend operations to indicate that all write operations to memory have completed. These write pause operations may be used to empty the read/write buffer.

The external host processor 110 will have its own system memory. The node private virtual address space of the application may include the host processor system memory and the node private memory of system 100. Access to system memory by the external host processor 110 may be kept coherent by the host processor's cache coherency protocol. Access by the external host processor 110 to the system 100's node private memory across the PCIe or other communication interface 130 may be kept coherent by not allowing the host processor 110 to cache data. Other host-to-system 100 node interfaces (i.e., CCIX or OpenCAPI) may allow the host processor to cache accessed data. Access to host processor system memory by the node computing elements of system 100 across the PCIe interface may be kept coherent by not allowing the computing elements to cache data. Other host-to-system 100 node interfaces (i.e., CCIX or OpenCAPI) may allow the computing elements of system 100 to cache data.

The external host processor 110 may access the private memory of the node through a PCIe or other communication interface 130. These accesses may not be cached by the external host processor 110. Similarly, all node processing elements may access the memory of an external processor through a PCIe or other communication interface 130. Performance is typically much higher if the processing element that is the node accesses the memory of an external host rather than having the host push data to the node. The architecture of the node compute element is able to handle a large number of pending requests and is able to tolerate longer access latencies.

As mentioned above, in representative embodiments, the process virtual address space of system 100 maps to physical memory on one or more physical nodes of system 100. The architecture of system 100 includes the concept of "virtual" nodes. The virtual address of the system 100 contains a virtual node identifier. The virtual node identifier allows the requesting computing element to determine whether the virtual address refers to a local node memory or a remote node memory. A virtual address referring to a local node memory is translated to a cost node physical address by a requesting computing element. A virtual address referring to a remote node memory is sent to the remote node, where upon entering the node, the virtual address translates to a remote node physical address.

The concept of virtual nodes allows processes to use the same set of virtual node identifiers regardless of what physical node the application is actually executing on. The range of virtual node identifiers for a process starts at zero and increases to a value of N-1, where N is the number of virtual nodes in the process. The number of virtual nodes in the process is determined at run-time. The application makes a system call to obtain the physical node. The operating system then determines how many virtual nodes the process will have. The number of physical nodes given to a process is limited by the number of physical nodes in system 100. The number of virtual nodes may be equal to or greater than the number of physical nodes, but must be a power of two. Having a large number of virtual nodes allows the memory 125 to be more uniformly distributed across physical nodes. By way of example, if there are 5 physical nodes and the process is set to use 32 virtual nodes, then the shared distributed memory may be distributed across the physical nodes in increments of 1/32. These five nodes will have a total of shared distributed memory for each node (7/32, 7/32, 6/32, 6/32, 6/32). The consistency of the memory distribution also makes these five nodes have more consistent bandwidth requirements.

Having more virtual nodes than physical nodes within a process implies that multiple virtual nodes are assigned to one physical node. The compute elements of the node will each have a small local node virtual node ID table of processes. There will be a maximum number of virtual node IDs per physical node ID. For example, the maximum number of virtual node IDs per physical node ID may be eight, allowing memory and bandwidth to be fairly consistent for different physical nodes without oversizing the virtual node ID table for each computing element.

The architecture of system 100 defines a single common virtual address space for use by all computing elements. This common virtual address space is used by all threads executing on the computing elements (host processor 110, HTP300, and HTF200) of system 100 on behalf of the application. The virtual-to-physical address translation process for an extensible multi-node system is carefully defined to ensure minimal performance degradation as the system 100 is extended. As a solution to this extended problem, the architecture of system 100 pushes virtual-to-physical address translation to the node where the physical memory resides. Performing a virtual-to-physical translation means that the referenced virtual address is passed in the request packet sent to the destination node. The request packet must be routed from the information in the virtual address (since the physical address is not available until the packet reaches the destination node). The virtual address is defined by a destination node ID embedded in the address. The exception is the external host virtual address to the node local private memory. This exception is required due to the limitations of the x86 processor virtual address space.

The virtual address of the current generation of x86 processors is 64 bits wide. However, for these 64-bit addresses, only the lower 48 bits are implemented. The upper 16 bits must contain the lower 48 bits of the sign extension value. According to existing processor limitations, the virtual address space of an application running on a standard Linux operating system is split into virtual addresses where the upper bits are all zeros or all 1's. FIG. 37 illustrates a virtual address space format supported by the architecture of the system 100.

The virtual addresses of the system 100 are defined to support a full 64-bit virtual address space. The upper three bits of the virtual address are used to specify the address format. The format is defined in table 31.

Table 31:

virtual address format ID description

The anomalies noted in table 31 may occur due to two situations: (1) the private address is sent to the remote node HTP or HTF computing element as an argument of the sent call or return operation, or (2) a data structure in shared memory creates a pointer to private memory.

Fig. 38 shows a conversion process of each virtual address format. Referring to fig. 37 and 38:

(a) formats 0 and 7 are used by the external host processor 110 and the local node host processor 110, the HTP300, and the HTF200 computing elements to access the external host memory as well as the local node private memory. The source compute element of the memory request translates the virtual address to a physical address.

(b) Formats 1 and 6 are used by the local node host processor 110, HTP300, and HTF200 computing elements to access local node private memory as well as external host memory. It should be noted that using this format allows the remote node device to verify that the local node private memory reference is actually intended for the local node. This situation applies to the case where the private virtual address of the local node is used by the remote node. The remote node may compare the embedded node ID to the local node ID and detect a memory reference error. Note that this detection capability is not available for format 0.

(c) Format 2 is used by all node host processors 110, HTP300 and HTF200 computing elements to access non-interleaved distributed shared memory. Allocation of this memory format will allocate contiguous blocks of physical memory on the node where the allocation is made. Each node of the process is numbered with a virtual node ID that starts at zero and increases to the number of nodes in the process. Virtual-to-physical address translation first translates the virtual node ID in the virtual address to a physical node ID. Node ID translation is performed at the source node. Once translated, the physical node ID is used to route the request to the destination node. It should be noted that the global space id (gsid) and the virtual address are both sent in packets to the destination node. Once at the destination node, the remote node interface receives the request packet and translates the virtual address into the physical address of the node.

(d) Format 3 is used by all node host processors 110, HTP300 and HTF200 computing elements to access interleaved distributed shared memory. The allocation of this memory format will allocate a block of memory on each node participating in the interleaving (the number of nodes in this process is at most a power of two). References to this format are interleaved at a4 kbyte granularity (actual interleave granularity is under investigation). The first step of the translation process is to swap the virtual node ID in the virtual address from lower to upper bits (into the position starting at bit 48). After the node IDs are exchanged, the virtual node IDs are converted to physical node IDs. Node ID exchange and translation occurs at the source node. The physical node ID is used to route the request to the destination node. It should be noted that the global space id (gsid) and the virtual address are both sent in packets to the destination node. Once at the destination node, the remote node interface receives the request packet and translates the virtual address into a node physical address.

(e) Formats 4 and 5 are not used and in representative embodiments, these formats are illegal, which may generate a reference exception.

Many of the advantages of the representative embodiments are readily apparent. Representative apparatus, systems, and methods provide a computing architecture capable of providing a high performance and energy efficient solution for compute intensive cores, for example, to compute Fast Fourier Transforms (FFTs) and Finite Impulse Response (FIR) filters for sensing, communication, and analysis applications, such as synthetic aperture radar, 5G base stations, and graphic analysis applications, such as, but not limited to, graphic clustering using spectral techniques, machine learning, 5G networking algorithms, and large die codes.

Also, the representative devices, systems, and methods provide a processor architecture capable of self-scheduling, massively parallel processing, and further interacting with and controlling a configurable computing architecture to execute any of these different applications.

As used herein, a "processor core" 705 may be any type of processor core and may be embodied as one or more processor cores configured, designed, programmed, or otherwise adapted to perform the functionality discussed herein. As used herein, a "processor" 110 may be any type of processor and may be embodied as one or more processors configured, designed, programmed or otherwise adapted to perform the functionality discussed herein. When the term processor is used herein, the processor 110 or 300 may include the use of a single integrated circuit ("IC"), or may include the use of multiple integrated circuits or other components connected, arranged, or grouped together, such as controllers, microprocessors, digital signal processors ("DSPs"), array processors, graphics or image processors, parallel processors, multi-core processors, custom ICs, application specific integrated circuits ("ASICs"), field programmable gate arrays ("FPGAs"), adaptive logic, or the likeThe IC, associated memory (e.g., RAM, DRAM, and ROM), and other ICs and components, whether analog or digital, should be computed. Thus, as used herein, the term processor or controller should be understood to mean and include a single IC, or the following arrangement, equally: a custom IC, ASIC, processor, microprocessor, controller, FPGA, adaptive computing IC, or some other grouping of integrated circuits that perform the functions discussed herein, and having associated memory, e.g., microprocessor memory or additional RAM, DRAM, SDRAM, SRAM, MRAM, ROM, flash, EPROM, or E²And (7) PROM. The processor 110 or 300 with associated memory may be adapted or configured (by programming, FPGA interconnection, or hard-wiring) to perform the methods of the present invention, as discussed herein. For example, the method may be programmed and stored in the processor 300 and other equivalent components with associated memory (and/or memory 125) as a set of program instructions or other code (or equivalent configuration or other program) for later execution when the processor 110 or 300 is operational (i.e., powered on and running). Equally, while processor 300 may be implemented in whole or in part as an FPGA, custom IC, and/or ASIC, the FPGA, custom IC, or ASIC may also be designed, configured, and/or hardwired to implement the methods of the present invention. For example, the processor 110 or 300 may be implemented as the following arrangement: analog and/or digital circuits, controllers, microprocessors, DSPs and/or ASICs, which are collectively referred to as "processors" or "controllers," and which are respectively hardwired, programmed, designed, adapted or configured to implement the methods of the present invention, including possibly implemented in conjunction with the memory 125.

Depending on the embodiment chosen, the memory 125, 325, which may comprise a data repository (or database), may be embodied in any number of forms, including within any computer or other machine-readable data storage medium, memory device, or other storage or communication device for information storage or communication, currently known or available in the future, including but not limited to a memory integrated circuit ("IC") or memory portion of an integrated circuit (e.g., the processor 130 or resident memory within a processor IC), whether volatile or non-volatileWhether removable or non-removable, including but not limited to RAM, flash, DRAM, SDRAM, SRAM, MRAM, FeRAM, ROM, EPROM, or E²PROM, or any other form of memory device such as a magnetic hard drive, optical drive, magnetic disk or tape drive, hard drive, other machine readable storage or memory medium such as a floppy disk, CDROM, CD-RW, Digital Versatile Disk (DVD), or other optical storage, any other type of memory, storage medium, or data storage device or circuit, which is or becomes known. The memory 125, 325 may be used to store various look-up tables, parameters, coefficients, other information and data, programs or instructions (of the software of the present invention), and other types of tables, such as database tables.

As noted above, the processors 110, 300 are hard-wired or programmed using the software and data structures of the present invention, e.g., to perform the methods of the present invention. Accordingly, the system and related methods of the present invention, including the various instructions, may be embodied in software that provides such programming or other instructions, e.g., instruction sets and/or metadata embodied in a non-transitory computer-readable medium, as discussed above. In addition, metadata may also be utilized to define various data structures of a lookup table or database. By way of example, and not limitation, such software may be in the form of source code or object code. The source code further may be compiled into some form of instruction or object code (including assembly language instructions or configuration information). The software, source code, or metadata of the present invention may be embodied as any type of code, such as C, C + +, Matlab, SystemC, LISA, XML, Java, Brew, SQL and variants thereof (e.g., SQL 99 or proprietary versions of SQL), DB2, Oracle, or any other type of programming language that performs the functionality discussed herein, including various hardware definitions or hardware simulation languages (e.g., Verilog, VHDL, RTL) and resulting database files (e.g., GDSII). Thus, "construct," "program construct," "software construct," or "software" as used equivalently herein means and refers to any programming language, of any kind, having any syntax or signature, that provides or can be interpreted to provide the associated functionality or methods as specified (e.g., including the processors 110, 300 when instantiated or loaded into and executed by a processor or computer).

The software, metadata or other source code and any resulting bit files (object code, databases or look-up tables) of the invention may be embodied within any tangible non-transitory storage medium (e.g., any of computer or other machine-readable data storage media) as computer-readable instructions, data structures, program modules or other data, such as discussed above with respect to memory 125, 325 (e.g., floppy disks, CDROMs, CD-RWs, DVDs, magnetic hard drives, optical drives, or any other type of data storage device or medium, as mentioned above).

Communication interface 130 is for appropriate connection to an associated channel, network, or bus; for example, the communication interface 130 may provide impedance matching, drivers, and other functionality for a wired or wireless interface, may provide demodulation and analog-to-digital conversion for a wireless interface, and may provide a physical interface for the processors 110, 300 and/or the memory 125 with other devices, respectively. In general, communication interface 130 is used to receive and transmit data, such as program instructions, parameters, configuration information, control messages, data, and other related information, in accordance with a selected embodiment.

The communication interface 130 may be implemented as is known or may become known in the art to provide data communication between the HTF200 and/or the processors 110, 300 and any type of network or external device (e.g., wireless, optical or wired) using any applicable standard, such as, but not limited to, one of the various PCI, USB, RJ 45, ethernet (fast ethernet, gigabit ethernet, 300ase-TX, 300ase-FX, etc.), IEEE 802.11, bluetooth, WCDMA, WiFi, GSM, GPRS, EDGE, 3G, and other standards and systems mentioned above, and may include impedance matching capabilities, voltage conversion for interfacing the low voltage processor with the higher voltage control bus, wired or wireless transceivers, and various switching mechanisms to turn various lines or connectors on or off in response to signaling from the processor 130 (e.g., a transistor). Additionally, communication interface 130 may also be configured and/or operable to receive and/or transmit signals external to system 100, such as through hard wiring or RF or infrared optical signaling, for example, to receive information, such as in real time, for output, such as on a display. The communication interface 130 may provide a connection to any type of bus or network fabric or medium using any chosen architecture. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Enhanced ISA (EISA) bus, Micro Channel Architecture (MCA) bus, Peripheral Component Interconnect (PCI) bus, SAN bus, or any other communication or signaling medium such as Ethernet, ISDN, T1, satellite, wireless, and the like.

The present disclosure is to be considered as an exemplification of the principles of the invention and is not intended to limit the invention to the specific embodiments illustrated. In this regard, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the above and below, shown in the drawings, or described in the examples. The systems, methods, and apparatus according to the present invention are capable of other embodiments and of being practiced and carried out in various ways.

Although the present invention has been described with respect to specific embodiments thereof, these embodiments are merely illustrative, and not restrictive of the invention. In the description herein, numerous specific details are provided, such as examples of electronic components, electronic and structural connections, materials, and structural changes, to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that an embodiment of the invention can be practiced without one or more of the specific details, or with other apparatus, systems, assemblies, components, materials, parts, and/or the like. In other instances, well-known structures, materials, or operations are not specifically shown or described in detail to avoid obscuring aspects of embodiments of the invention. Furthermore, the various drawings are not to scale and should not be taken as limiting.

Reference throughout this specification to "one embodiment," an embodiment, "or" particular "embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment of the present invention, but not necessarily in all embodiments, and further, does not necessarily refer to the same embodiment. Furthermore, the particular features, structures, or characteristics of any specific embodiment of the present invention may be combined in any suitable manner and in any suitable combination with one or more other embodiments, including using the selected features and not using the other features accordingly. In addition, many modifications may be made to adapt a particular application, situation or material to the essential scope and spirit of the present invention. It is to be understood that other variations and modifications of the embodiments of the present invention described and illustrated herein are possible in light of the teachings herein and are to be considered as part of the spirit and scope of the present invention.

To recite a range of values herein, each intervening value, to the same degree of accuracy, is explicitly recited. For example, for the range 6-9, the values 7 and 8 are encompassed in addition to 6 and 9, and for the range 6.0-7.0, the values 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, and 7.0 are expressly encompassed. In addition, each intervening subrange within the ranges in any combination is contemplated as being within the scope of the disclosure. For example, for the range of 5-10, sub-ranges of 5-6, 5-7, 5-8, 5-9, 6-7, 6-8, 6-9, 6-10, 7-8, 7-9, 7-10, 8-9, 8-10, and 9-10 are contemplated as being within the scope of the disclosed ranges.

It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as may be useful in accordance with a particular application. It is also within the scope of the invention for the components to be integrally formed into combinations, particularly for embodiments in which the separation or combination of discrete components is not clear or readily discernible. Additionally, as used herein, the term "coupled," including its various forms ("coupled" or "coupled"), refers to and includes any direct or indirect electrical, structural, or magnetic coupling, connection, or attachment, or adaptations or capabilities thereof for such direct or indirect electrical, structural, or magnetic coupling, connection, or attachment, including components that are integrally formed and components that are coupled via or through another component.

With respect to signals, what is referred to herein is a parameter that "represents" a given metric or "represents" a given metric, where a metric is a measure of the state of at least a portion of the regulator or its input or output. A parameter is considered to represent a metric if it is sufficiently directly related to the metric that the adjustment parameter can satisfactorily adjust the metric. If a parameter represents multiple or a portion of a metric, the parameter may be considered an acceptable representation of the metric.

Moreover, any signal arrows in the drawings/figures should be considered only as exemplary, and not limiting, unless otherwise specifically noted. Combinations of components of steps will also be considered within the scope of the invention, especially if the ability to separate or combine is unclear or foreseeable. The disjunctive term "or" as used herein and throughout the appended claims is generally intended to mean "and/or" that has both a conjunctive and disjunctive meaning (and is not limited to the exclusive or meaning) unless otherwise indicated. As described herein and used throughout the appended claims, "a" and "the" include plural references unless the context clearly dictates otherwise. Likewise, as used in the description herein and throughout the appended claims, the meaning of "in … …" includes "in … …" and "on … …" unless the context clearly dictates otherwise.

The foregoing description of illustrated embodiments of the invention, including what is described in the summary or abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed herein. From the foregoing, it will be observed that numerous variations, modifications and substitutions are contemplated and may be made without departing from the spirit and scope of the novel concepts of the present invention. It is to be understood that no limitation with respect to the specific methods and apparatus illustrated herein is intended or should be inferred. It is, of course, intended to cover by the appended claims all such modifications as fall within the scope of the claims.

183页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：数据管理装置、检索装置、登记装置、数据管理方法和数据管理程序

System with hybrid-threaded processor, hybrid-threaded fabric with configurable computing elements, and hybrid interconnect network

相关技术

网友询问留言