Data transmission method, processor system, readable storage medium and electronic device

文档序号:1875397 发布日期:2021-11-23 浏览:23次 中文

阅读说明:本技术 数据传输方法、处理器系统、可读存储介质及电子设备 (Data transmission method, processor system, readable storage medium and electronic device ) 是由 裴京 王松 马骋 徐海峥 于 2021-08-25 设计创作,主要内容包括:本公开涉及一种数据传输方法、处理器系统、可读存储介质及电子设备,该方法包括:在第一工作周期中,计算核簇将计算结果数据传输到缓存核簇,缓存核簇将计算结果数据的第一数据传输到接收核簇;在第二工作周期中,计算核簇执行计算任务,缓存核簇将计算结果数据的第二数据传输到接收核簇,第二数据包括计算结果数据中、除第一数据之外的数据。该方法可减少处理器系统中各芯片的功能核的空转、等待时间,提高了处理器系统的效率和执行速度。(The present disclosure relates to a data transmission method, a processor system, a readable storage medium, and an electronic device, the method including: in a first working period, the computing core cluster transmits computing result data to the cache core cluster, and the cache core cluster transmits first data of the computing result data to the receiving core cluster; in a second working cycle, the computing core cluster executes the computing task, the cache core cluster transmits second data of the computing result data to the receiving core cluster, and the second data comprises data except the first data in the computing result data. The method can reduce idle running and waiting time of the functional cores of each chip in the processor system, and improve the efficiency and execution speed of the processor system.)

1. A data transmission method is applied to a processor system, the processor system comprises a plurality of chips, each chip comprises a plurality of functional cores, the data transmission speed between the functional cores in the same chip is higher than that between the functional cores in different chips,

wherein, the plurality of functional cores of each chip are divided into a plurality of core clusters according to different tasks, under the condition that a first chip transmits data to a second chip, the first chip comprises a computing core cluster for executing the computing task and a cache core cluster for executing data caching and transmitting tasks, the second chip comprises a receiving core cluster for executing a data receiving task, the first chip and the second chip are any two chips of the processor system,

wherein the method comprises the following steps:

in a first working period, the computing core cluster transmits computing result data to the cache core cluster, and the cache core cluster transmits first data of the computing result data to the receiving core cluster;

in a second working cycle, the computing core cluster executes a computing task, and the caching core cluster transmits second data of the computing result data to the receiving core cluster, wherein the second data comprises data except the first data in the computing result data.

2. The method of claim 1, wherein in the first operating cycle, timing synchronization between the compute core cluster, the cache core cluster, and the receive core cluster;

in the second working cycle, the time sequence between the computing core cluster and the cache core cluster and between the computing core cluster and the receiving core cluster are asynchronous, and the time sequence between the cache core cluster and the receiving core cluster is synchronous.

3. The method of claim 1, wherein in a first working cycle, the compute core cluster transmitting compute result data to the cache core cluster, comprising:

and under the condition that the data transmission of the calculation result is completed, ending the first work cycle.

4. The method of claim 1, wherein data is transferred between different chips of the processor system in a serial manner,

wherein, in a first working cycle, the transmitting, by the cache core cluster, first data of the computation result data to the receiving core cluster includes:

a first functional core in the cache core cluster transmits first subdata of the first data to the receiving core cluster;

a second functional core in the cache core cluster transmits second sub data of the first data to a multicast core cluster of a third chip, so that the multicast core cluster transmits the second sub data to the receiving core cluster,

wherein the third chip is at least one chip of the processor system except the first chip and the second chip.

5. The method of claim 4, wherein in a second work cycle, the caching core cluster transmitting second data of the computation result data to the receiving core cluster, comprising:

a first functional core in the cache core cluster transmits third subdata of the second data to the receiving core cluster;

and the second functional core in the cache core cluster transmits fourth sub-data of the second data to the multicast core cluster of the third chip, so that the multicast core cluster transmits the fourth sub-data to the receiving core cluster.

6. The method of claim 5, further comprising:

in a first working period, the multicast core cluster of the third chip transmits the second subdata to the receiving core cluster of the second chip;

in a second working period, the multicast core cluster of the third chip transmits the fourth sub-data to the receiving core cluster of the second chip.

7. The method of claim 1, wherein the first duty cycle comprises N first sub-cycles, N being an integer greater than 1,

wherein, in a first working cycle, the transmitting, by the cache core cluster, first data of the computation result data to the receiving core cluster includes:

in a1 st first sub-period, the caching core cluster does not transmit data to the receiving core cluster;

in the 2 nd first sub-cycle to the nth first sub-cycle, the caching core cluster transmits the first data to the receiving core cluster.

8. A processor system comprising a plurality of chips, each chip comprising a plurality of functional cores, the data transfer speed between the functional cores located within the same chip being higher than the data transfer speed between the functional cores located in different chips,

wherein, the plurality of functional cores of each chip are divided into a plurality of core clusters according to different tasks, under the condition that a first chip transmits data to a second chip, the first chip comprises a computing core cluster for executing the computing task and a cache core cluster for executing data caching and transmitting tasks, the second chip comprises a receiving core cluster for executing a data receiving task, the first chip and the second chip are any two chips of the processor system,

wherein the first chip is configured to:

in a first working period, the computing core cluster transmits computing result data to the cache core cluster, and the cache core cluster transmits first data of the computing result data to the receiving core cluster;

in a second working cycle, the computing core cluster executes a computing task, and the caching core cluster transmits second data of the computing result data to the receiving core cluster, wherein the second data comprises data except the first data in the computing result data.

9. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1 to 7.

10. An electronic device, characterized in that the electronic device comprises a processor system as claimed in claim 8.

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data transmission method, a processor system, a readable storage medium, and an electronic device.

Background

With the continuous development of artificial intelligence technology, the requirement of a user on the processing capacity of a chip is higher and higher, and the use of many-core chips is wider and wider due to the limited processing capacity of single-core chips. The chip processing capacity is improved by distributing the computing tasks to the combination of a plurality of cores to be executed in parallel. In many-core architecture chips, not only the computation tasks assigned in each core need to be performed, but also data transmission between cores and data transmission between chips needs to be performed. The execution mechanism of the computation tasks and the data transmission among the many-core chips affects the overall operation performance of the chips.

Disclosure of Invention

In view of the above, the present disclosure provides a data transmission method, a processor system, a readable storage medium, and an electronic device.

According to an aspect of the present disclosure, there is provided a data transmission method, which is applied to a processor system, the processor system includes a plurality of chips, each chip including a plurality of functional cores, data transfer speed between the functional cores located within the same chip being higher than data transfer speed between the functional cores located in different chips, wherein the plurality of functional cores of each chip are divided into a plurality of core clusters according to different tasks, and in the case of transmitting data from a first chip to a second chip, the first chip includes a computing core cluster for executing a computing task, and a cache core cluster for performing a data caching and transmitting task, the second chip including a receive core cluster for performing a data receiving task, the first chip and the second chip are any two chips of the processor system, wherein the method comprises the following steps:

in a first working period, the computing core cluster transmits computing result data to the cache core cluster, and the cache core cluster transmits first data of the computing result data to the receiving core cluster; in a second working cycle, the computing core cluster executes a computing task, and the caching core cluster transmits second data of the computing result data to the receiving core cluster, wherein the second data comprises data except the first data in the computing result data.

In a possible implementation manner, in the first working cycle, timing synchronization is performed among the computing core cluster, the cache core cluster, and the receiving core cluster; in the second working cycle, the time sequence between the computing core cluster and the cache core cluster and between the computing core cluster and the receiving core cluster are asynchronous, and the time sequence between the cache core cluster and the receiving core cluster is synchronous.

In one possible implementation manner, in a first working cycle, the transmitting, by the compute core cluster, the compute result data to the cache core cluster includes: and under the condition that the data transmission of the calculation result is completed, ending the first work cycle.

In one possible implementation manner, the data is transmitted between different chips of the processor system in a serial manner, where in a first working cycle, the cache core cluster transmits first data of the computation result data to the receiving core cluster, including: a first functional core in the cache core cluster transmits first subdata of the first data to the receiving core cluster; and a second functional core in the cache core cluster transmits second subdata of the first data to a multicast core cluster of a third chip, so that the multicast core cluster transmits the second subdata to the receiving core cluster, wherein the third chip is at least one chip except the first chip and the second chip in the processor system.

In one possible implementation manner, in the second work cycle, the transmitting, by the cache core cluster, the second data of the computation result data to the receiving core cluster includes: a first functional core in the cache core cluster transmits third subdata of the second data to the receiving core cluster; and the second functional core in the cache core cluster transmits fourth sub-data of the second data to the multicast core cluster of the third chip, so that the multicast core cluster transmits the fourth sub-data to the receiving core cluster.

In one possible implementation, the method further includes: in a first working period, the multicast core cluster of the third chip transmits the second subdata to the receiving core cluster of the second chip; in a second working period, the multicast core cluster of the third chip transmits the fourth sub-data to the receiving core cluster of the second chip.

In a possible implementation manner, the first work cycle includes N first sub-cycles, where N is an integer greater than 1, and in the first work cycle, the transmitting, by the cache core cluster, the first data of the computation result data to the receiving core cluster includes: in a1 st first sub-period, the caching core cluster does not transmit data to the receiving core cluster; in the 2 nd first sub-cycle to the nth first sub-cycle, the caching core cluster transmits the first data to the receiving core cluster.

According to another aspect of the present disclosure, there is provided a processor system, the processor system including a plurality of chips, each chip including a plurality of functional cores, a data transmission speed between the functional cores located within the same chip being higher than a data transmission speed between the functional cores located in different chips, wherein the plurality of functional cores of each chip are divided into a plurality of core clusters according to task differences, in a case where data is transmitted from a first chip to a second chip, the first chip includes a computing core cluster for performing a computing task, and a cache core cluster for performing a data cache and transmission task, the second chip includes a receiving core cluster for performing a data reception task, the first chip and the second chip are any two chips of the processor system, wherein the first chip is configured to:

in a first working period, the computing core cluster transmits computing result data to the cache core cluster, and the cache core cluster transmits first data of the computing result data to the receiving core cluster; in a second working cycle, the computing core cluster executes a computing task, and the caching core cluster transmits second data of the computing result data to the receiving core cluster, wherein the second data comprises data except the first data in the computing result data.

In a possible implementation manner, in the first working cycle, timing synchronization is performed among the computing core cluster, the cache core cluster, and the receiving core cluster; in the second working cycle, the time sequence between the computing core cluster and the cache core cluster and between the computing core cluster and the receiving core cluster are asynchronous, and the time sequence between the cache core cluster and the receiving core cluster is synchronous.

In one possible implementation manner, in a first working cycle, the transmitting, by the compute core cluster, the compute result data to the cache core cluster includes: and under the condition that the data transmission of the calculation result is completed, ending the first work cycle.

In one possible implementation manner, the data is transmitted between different chips of the processor system in a serial manner, where in a first working cycle, the cache core cluster transmits first data of the computation result data to the receiving core cluster, including: a first functional core in the cache core cluster transmits first subdata of the first data to the receiving core cluster; and a second functional core in the cache core cluster transmits second subdata of the first data to a multicast core cluster of a third chip, so that the multicast core cluster transmits the second subdata to the receiving core cluster, wherein the third chip is at least one chip except the first chip and the second chip in the processor system.

In one possible implementation manner, in the second work cycle, the transmitting, by the cache core cluster, the second data of the computation result data to the receiving core cluster includes: a first functional core in the cache core cluster transmits third subdata of the second data to the receiving core cluster; and the second functional core in the cache core cluster transmits fourth sub-data of the second data to the multicast core cluster of the third chip, so that the multicast core cluster transmits the fourth sub-data to the receiving core cluster.

In one possible implementation, the system further includes: in a first working period, the multicast core cluster of the third chip transmits the second subdata to the receiving core cluster of the second chip; in a second working period, the multicast core cluster of the third chip transmits the fourth sub-data to the receiving core cluster of the second chip.

In a possible implementation manner, the first work cycle includes N first sub-cycles, where N is an integer greater than 1, and in the first work cycle, the transmitting, by the cache core cluster, the first data of the computation result data to the receiving core cluster includes: in a1 st first sub-period, the caching core cluster does not transmit data to the receiving core cluster; in the 2 nd first sub-cycle to the nth first sub-cycle, the caching core cluster transmits the first data to the receiving core cluster.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the above-described method.

According to another aspect of the present disclosure, there is provided an electronic device comprising the processor system described above.

According to the data transmission method of the embodiment of the disclosure, the data transmission method can be divided into a plurality of core clusters according to different tasks, in a first working period, the calculation result data stored in the calculation core cluster is transmitted to the cache core cluster, and then the cache core cluster transmits part of the data of the calculation result data to the receiving core cluster; then, in a second work cycle, the computing core cluster may perform the computing task, and at the same time, the cache core cluster may transmit the remaining data in the computing result data to the receiving core cluster. The method can avoid the situation that many-core chips in the processor system execute the calculation task of each core and data transmission among the cores according to the global synchronous time sequence, effectively reduce the idle running and waiting time of each functional core among the chips in the processor system, and improve the calculation efficiency and the execution speed of the processor system.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram illustrating a data transfer execution mechanism based on a many-core architecture in the related art;

FIG. 2 shows a schematic diagram of a processor system, according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a cluster of chip cores performing a computational task, according to an embodiment of the present disclosure;

fig. 4 shows a flow diagram of a data transmission method according to an embodiment of the present disclosure;

fig. 5 is a diagram illustrating a data transmission method in the related art;

fig. 6 shows a schematic diagram of a data transmission method according to an embodiment of the present disclosure;

fig. 7 is a diagram illustrating an inter-chip data transmission method in the related art;

FIG. 8 shows a schematic diagram of an inter-slice data transfer method according to an embodiment of the present disclosure;

FIG. 9 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure;

fig. 10 shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

In the many-core chip system, the execution mechanism of the computation task in the functional core and the data transmission task between the functional cores influences the overall operation performance of the many-core chip. In order to solve the above problem, a Bulk Synchronization Parallel (BSP) mechanism is provided in the related art. In this mechanism, the basic operation of many-core chips mainly includes three phases: (1) in the local calculation stage, each functional core only performs local calculation on data stored in the local memory. (2) And in the global communication stage, each functional core operates on non-local data, including data transmission among the functional cores. (3) And a barrier synchronization stage, which waits for the end of all communication behaviors.

Another execution mechanism is provided in the related art, in which the execution of many-core chips is divided into multiple synchronous execution cycles by a global synchronization signal. In each execution cycle, each functional core simultaneously executes a local computation task and an inter-core communication task, and each functional core transmits an obtained computation result while performing local computation.

Fig. 1 is a schematic diagram illustrating a data transfer execution mechanism based on a many-core architecture in the related art. As shown in FIG. 1, the many-function cores need to perform local computation and inter-core communication tasks in accordance with a globally synchronized duty cycle. In one example, FIG. 1 includes three timing clusters, each of which may perform inter-core data transfers (including data transmission and data reception) and perform computational tasks. For example, timing cluster 1 may include three parts of data reception a1, data transmission B1, and data calculation C1, timing cluster 2 may include three parts of data reception a2, data transmission B2, and data calculation C2, and timing cluster 3 may include three parts of data reception A3, data transmission B3, and data calculation C3. The data transmission part and the data calculation part of the same time sequence cluster can be completed in different work periods.

As shown in fig. 1, timing cluster 2, and timing cluster 3 start a data transmission task in synchronization, and end synchronization after the data transmission is completed. When the functional core in the timing cluster 1 completes the data reception a1 and the data transmission B1, it needs to idle or wait for the data transmission B2 in the timing cluster 2 to complete, and after the data reception A3 in the timing cluster 3 is complete, it performs synchronization, and then starts to perform data computation C1. The functional cores in the time sequence cluster 1, the time sequence cluster 2 and the time sequence cluster 3 start and end synchronously when data transmission is carried out, so that the time sequence cluster idles or waits before data calculation is carried out, limited resources of the many-core processor are wasted, and the utilization rate and the execution efficiency of the many-core processor are reduced.

Therefore, aiming at the technical problems that the many-core chip needs to execute the communication task between local chips according to the work cycle of global synchronization, and the local calculation and the communication between the chips are executed in a serial mode, so that the execution efficiency is reduced.

FIG. 2 shows a schematic diagram of a processor system, according to an embodiment of the present disclosure. The data transmission method of the embodiment of the disclosure is applied to a processor system, and the processor system comprises a plurality of chips, wherein each chip comprises a plurality of functional cores.

In one possible implementation, as shown in fig. 2, each chip may include a plurality of functional cores, and data transmission may be performed between the functional cores inside each chip and between the functional cores of different chips; wherein each functional core may include a storage component for storing data for transmission with other functional cores.

In one possible implementation, as shown in FIG. 2, each functional core may include a processing component and a storage component. Wherein, the processing component can comprise a dendrite unit, an axon unit, a cell body unit and a routing unit; the memory unit may include a plurality of memory cells.

In a possible implementation manner, the plurality of chips included in the processor system may be brain-like computing chips, that is, a neural morphological circuit integrating computation, and by taking a processing mode of the brain as a reference, the processing efficiency is improved and the power consumption is reduced by simulating the transmission and processing of information by neurons in the brain. Each brain-like computing chip can comprise a plurality of functional cores, and different tasks can be independently processed among the functional cores or the same task can be processed in parallel, so that the processing efficiency is improved. The functional cores can transmit inter-core information through the routing units in the functional cores.

For example, within the functional core, a storage component and a processing component may be provided. The Memory unit may include a plurality of Memory cells, wherein the Memory cells may be Static Random Access Memories (SRAMs). For example, an SRAM with a read/write width of 16B and a capacity of 12KB may be included. The capacity size and bit width of the memory unit are not limited in the present disclosure.

The processing means may comprise a dendrite unit, an axon unit, a soma unit and a routing unit. The processing component can simulate the processing mode of neurons of the brain on information, wherein the dendritic units are used for receiving signals, the axonal units are used for sending spike signals, the soma units are used for integrated transformation of the signals, and the routing units are used for information transmission with other functional cores. The processing unit in the functional core can perform read-write access on a plurality of storage units of the storage unit to perform data interaction with the storage unit in the functional core, and can respectively undertake respective data processing tasks and/or data transmission tasks to obtain data processing results, or perform communication with other functional cores. The communication with other functional cores includes communication with other functional cores in the chip and communication with functional cores in other chips.

In one possible implementation, the data transfer speed between functional cores located within the same chip is higher than the data transfer speed between functional cores located in different chips.

For example, assume that a processor system may include a chip E including a functional core E1 and a functional core E2, and a chip F including a functional core F1 and a functional core F2. Data transmission between the functional core E1 and the functional core E2, and between the functional core F1 and the functional core F2 are data transmission between functional cores of the same chip, and the speed is relatively high. And data transmission between the functional core E1 and the functional core F1 (or the functional core F2), and between the functional core E2 and the functional core F1 (or the functional core F2) of different chips is relatively low in speed. For example, the data transfer speed between functional cores within the same chip may be 3 times the data transfer speed between functional cores of different chips.

It should be understood that the data transfer speed between functional cores within the same chip, and between functional cores of different chips, is related to the performance of the particular processor system hardware, and the present disclosure does not limit the data transfer speed between particular cores.

In one possible implementation manner, the plurality of functional cores of each chip are divided into a plurality of core clusters according to different tasks.

For example, the functional cores may be divided into different asynchronous core clusters according to different tasks to be executed or different execution times of the tasks, and each core cluster may include one or more functional cores. The functional cores belonging to the same asynchronous core cluster have synchronous local processing and inter-core communication cycles. For each core cluster, it can cyclically and continuously execute a plurality of cycles of tasks in time, each task corresponds to an execution period, and the time lengths (clock numbers) corresponding to the different task periods may be equal or may not be equal. Different clusters of cores, each executing according to a respective internal synchronization signal, are asynchronous in execution cycle.

Fig. 3 is a schematic diagram illustrating a core cluster of a chip according to an embodiment of the present disclosure executing a computation task, and as shown in fig. 3, the chip acquires a total computation task including 4 computation tasks (computation task 1, computation task 2, computation task 3, and computation task 4), and splits 16 cores in the chip into two core clusters, i.e., core cluster 1 and core cluster 2, according to the total computation task to be executed. In the core cluster 1, 9 cores are included for executing the computation task 1 and the computation task 2. In time, the first work cycle corresponding to the core cluster 1 executes the computing task 1, and the second work cycle executes the computing task 2. In the core cluster 2,7 cores are included for executing the computation task 3 and the computation task 4. In time, the first work cycle corresponding to the core cluster 2 executes the computation task 3, and the second work cycle executes the computation task 4. The two clusters of cores execute asynchronously.

It should be understood that the tasks in the above process may be computing tasks, and may also be transmission tasks, cache tasks, and other tasks, and the present disclosure does not limit the specific types of tasks.

For any two chips (which may be referred to as a first chip and a second chip) of a processor system, in the case of transmitting data from the first chip to the second chip, the first chip includes a compute core cluster for performing a compute task and a cache core cluster for performing a data cache and a transmission task, and the second chip includes a receive core cluster for performing a data reception task.

Fig. 4 shows a flow diagram of a data transmission method according to an embodiment of the present disclosure. As shown in fig. 4, the method includes:

in step S11, in a first working cycle, the computing core cluster transmits computing result data to the caching core cluster, and the caching core cluster transmits first data of the computing result data to the receiving core cluster;

in step S12, in a second work cycle, the computing core cluster executes a computing task, and the caching core cluster transmits second data of the computing result data to the receiving core cluster, where the second data includes data of the computing result data other than the first data.

For example, the first chip may perform clustering processing on a plurality of functional cores included in the first chip according to the received computation task and transmission task, for example, 9 functional cores may be used as a computation core cluster for processing the computation task; and 3 functional cores are used as a cache core cluster for processing transmission tasks. The second chip may perform clustering processing on the plurality of functional cores included in the second chip according to the received data receiving task, for example, 3 functional cores may be used as a receiving core cluster for processing the data receiving task.

The first chip and the second chip may be any two different chips in the processor system, and the specific number of the functional cores included in each divided core cluster may be determined according to the specific task and the performance of the functional cores, which is not limited in this disclosure.

In step S11, in the first work cycle, the cluster of computing cores in the first chip does not perform the computing task, but transmits the computing result data acquired before the first work cycle to the second chip. The method comprises the steps that calculation result data stored in a calculation core cluster of a first chip can be transmitted to a cache core cluster of the first chip, and the cache core cluster of the first chip transmits first data of the calculation result data to a receiving core cluster of a second chip;

according to the limitation of the hardware structure and the working mode of the processor system, the data transmission speed between the functional cores in the chip is higher than that between the functional cores in different chips. Therefore, the data transmission speed between the computing core cluster of the first chip and the cache core cluster of the first chip is higher than the data transmission speed between the cache core cluster of the first chip and the receiving core cluster of the second chip.

And transmitting all the calculation result data stored in the calculation core cluster of the first chip to the cache core cluster of the first chip, and ending the first working cycle. In this case, since the intra-chip transmission speed is greater than the inter-chip transmission speed, the cache core cluster of the first chip cannot transmit all the calculation result data to the receiving core cluster of the second chip, and only part of the data in the calculation result data, that is, the first data, can be transmitted to the receiving core cluster of the second chip.

In step S12, in the second duty cycle, since the computing core cluster of the first chip has already transmitted all the computing result data to the cache core cluster of the first chip in the first duty cycle. The computing core cluster can end the transmission task, execute the computing task, and is used for acquiring the computing result data to be transmitted in the next cycle.

The cache core cluster of the first chip transmits the first data to the receiving core cluster of the second chip in the first working cycle, but partial calculation result data which is not sent to the receiving core cluster of the second chip, namely the second data, is stored in the cache core cluster. The cache core cluster of the first chip may continue to execute the transmission task, and transmit second data, excluding the first data, in the calculation result data to the receiving core cluster of the second chip.

And after the computing core cluster of the first chip executes the computing task, obtaining the next round of computing result data to be sent, transmitting the second data to the receiving core cluster of the second chip by the cache core cluster of the first chip, ending the second working period, and starting the next round of circulation.

In this way, different core clusters can complete different tasks, the same task is completed in the core clusters, the calculation result data stored in the calculation core clusters can be transmitted to the cache core clusters in the first working period, and the cache core clusters transmit part of the data of the calculation result data to the receiving core clusters; then, in a second work cycle, the computing core cluster may perform the computing task, and at the same time, the cache core cluster may transmit the remaining data in the computing result data to the receiving core cluster. The method can effectively reduce idle running and waiting time of each functional core between chips in the processor system, and improve the computing efficiency and execution speed of the processor system.

A data transmission method according to an embodiment of the present disclosure is explained below.

Fig. 5 is a diagram illustrating a data transmission method in the related art. As shown in fig. 5, it is assumed that the computing core cluster of the first chip includes functional cores 2,7, functional cores 2,8, functional cores 3,7, and functional cores 3,8, and the receiving core cluster of the second chip includes functional cores 0, 11. The above four functional cores of the compute core cluster of the first chip need to send the 28KB of computation result data to the functional cores 0,11 of the receive core cluster of the second chip.

It should be understood that the computing core cluster and the receiving core cluster may include a plurality of functional cores, and the present disclosure only takes the functional cores 2,7, 2,8, 3,7, 3,8 and 0,11 included in the computing core cluster as examples, and the computing core cluster and the receiving core cluster may further include other functional cores.

As shown in fig. 5, the first chip transmits four functional cores (2,7, 2,8, 3,7, 3,8) of the computation core cluster to the functional cores 0,11 of the reception core cluster of the second chip at a transmission rate of 4KB per work subcycle, and within 7 work subcycle (P1 to P7).

Then, in the work sub-period (P8-P10), the four functional cores of the first chip computing core cluster execute the computing tasks and do not perform the data transmission task; the second chip receives the functional cores 0,11 of the core cluster idle waiting.

Therefore, when the computing core cluster of the first chip executes the computing task, the receiving core cluster of the second chip needs to idle for waiting, and the receiving core cluster of the second chip can start receiving only when the computing core cluster of the second chip finishes the computing task; in addition, in the process of receiving the calculation result data by the receiving core cluster of the second chip, along with the sending of the calculation result data by the first chip calculation core cluster, the calculation core cluster cannot perform other tasks. Therefore, as shown in fig. 5, it is necessary to wait for the global synchronization signal to perform the cycle switching and the data transmission, which may cause idle running and waiting of part of the functional cores, and greatly reduce the utilization rate and the execution efficiency of the chip.

In view of the problem that the data transmission method illustrated in fig. 5 is not efficient, fig. 6 illustrates a schematic diagram of a data transmission method according to an embodiment of the present disclosure. As shown in fig. 6, it is assumed that the computing core cluster of the first chip includes functional cores 2,7, functional cores 2,8, functional cores 3,7, and functional cores 3, 8; the cache core cluster of the first chip comprises functional cores 2 and 9; the receiving core cluster of the second chip includes functional cores 0, 11. The above four functional cores of the compute core cluster of the first chip need to send the 28KB of computation result data to the functional cores 0,11 of the receive core cluster of the second chip.

In step S11, as shown in fig. 6, in the first working cycle (T1 to T3), the first chip calculates four functional cores (2,7, 2,8, 3,7, 3,8) of the core cluster, transmits the calculation result data 28KB (i.e., 12KB +4KB) to the functional cores 2,9 of the on-chip cache core cluster, and the functional cores 2,9 of the first chip cache core cluster transmit the first data 8KB (4KB +4KB) of the calculation result data to the functional cores 0,11 of the second chip receive core cluster;

it should be understood that the computing core cluster, the cache core cluster and the receiving core cluster may include a plurality of functional cores, and the present disclosure only takes the functional cores 2,7, 2,8, 3,7, 3,8 included in the computing core cluster, 2,9 included in the cache core cluster, and 0,11 included in the receiving core cluster as examples, and the computing core cluster, the cache core cluster and the receiving core cluster may further include other functional cores.

In a possible implementation manner, the first work cycle includes N first sub-cycles, where N is an integer greater than 1, and in the first work cycle, the transmitting, by the cache core cluster, the first data of the computation result data to the receiving core cluster includes:

in a1 st first sub-period, the caching core cluster does not transmit data to the receiving core cluster;

in the 2 nd first sub-cycle to the nth first sub-cycle, the caching core cluster transmits the first data to the receiving core cluster.

For example, as shown in FIG. 6, the first duty cycle includes 3 first sub-cycles T1-T3. The number of the first sub-periods included in the first working period is not limited by the present disclosure, and the number of the first sub-periods may be determined according to a specific task to be processed.

In a1 st first sub-period T1, the functional cores 2 and 9 of the first chip cache core cluster run idle, and data is not transmitted to the second chip receiving core cluster functional cores 0 and 11;

at T2 in the 2 nd first sub-cycle, the functional core 2,9 of the first chip cache core cluster transfers 4KB of data to the first chip receive core cluster functional core 0, 11.

At T3 in the 3 rd first sub-cycle, the functional core 2,9 of the first chip cache core cluster transmits 4KB of data to the first chip receive core cluster functional core 0, 11.

In the 2 nd first sub-cycle and the 3 rd first sub-cycle, the functional cores 2 and 9 of the first chip cache core cluster transmit the first data of 8KB to the first chip receiving core cluster functional cores 0 and 11.

In this way, only in the 1 st first sub-period of the first working cycle, the cache core cluster of the first chip and the receiving core cluster of the second chip run idle; and because the inter-chip data transmission rate is greater than the inter-chip data transmission rate, the computing core cluster of the first chip can enter the second working period as soon as possible to execute the computing task, and the data transmission task can be asynchronously continued between the cache core cluster and the receiving core cluster. Idle running and waiting time of each functional core among chips in the processor system can be effectively reduced, and the computing efficiency and the execution speed of the processor system are improved.

In one possible implementation manner, in a first working cycle, the transmitting, by the compute core cluster, the compute result data to the cache core cluster includes:

and under the condition that the data transmission of the calculation result is completed, ending the first work cycle.

For example, as shown in fig. 6, in the 3 rd first sub-cycle T3 in the first working cycle, four functional cores (2,7, 2,8, 3,7, 3,8) of the first chip computing core cluster transmit the last 4KB data of the computation result data 28KB to the functional cores 2,9 of the on-chip cache core cluster, and the first working cycle ends.

Through the mode, partial calculation result data (namely second data) which are not transmitted to the second chip are stored in the cache core cluster of the first chip, and all the calculation result data which are transmitted by the calculation core cluster are beneficial to the calculation core cluster of the first chip entering the second working cycle as soon as possible to execute the calculation task, and the data transmission task can be asynchronously carried out between the cache core cluster and the receiving core cluster in the second working cycle. The idle running and waiting time of each functional core between the chips in the processor system can be effectively reduced, and the computing efficiency and the execution speed of the processor system are improved.

In a possible implementation manner, in the first working cycle, timing synchronization is performed among the computing core cluster, the cache core cluster, and the receiving core cluster;

for example, as shown in fig. 6, the timing between the four functional cores (2,7, 2,8, 3,7, 3,8) of the first chip's compute core cluster, the functional cores 2,9 of the first chip's cache core cluster, and the functional cores 0,11 of the second chip's receive core cluster are synchronized.

For example, in the 1 st first sub-period T1, the four functional cores (2,7, 2,8, 3,7, 3,8) of the compute core cluster of the first chip send 12KB of compute result data to the functional cores 2,9 of the cache core cluster in the slice in total, and at the same time, the functional cores 2,9 of the cache core cluster run idle and do not receive the data from the core cluster functional cores 0,11 of the second chip;

in the 2 nd first sub-period T2, the four functional cores (2,7, 2,8, 3,7, 3,8) of the compute core cluster of the first chip continue to send 12KB of compute result data to the functional cores 2,9 of the cache core cluster in the chip, and at the same time, the functional cores 2,9 of the cache core cluster of the first chip transmit 4KB of compute result data to the receive core cluster functional cores 0,11 of the first chip.

The 4KB computation result data is 4KB data in 12KB received by the functional core 2,9 of the cache core cluster in the 1 st first sub-period T1, and is added with the 12KB computation result data received by the first sub-period T2, and at the end of the first sub-period T2, 20KB computation result data is cached in the functional core 2,9 of the cache core cluster.

In the 3 rd first sub-period T3, the four functional cores (2,7, 2,8, 3,7, 3,8) of the compute core cluster of the first chip send the last 4KB of compute result data to the functional cores 2,9 of the cache core cluster in the chip, and at the same time, the functional cores 2,9 of the cache core cluster of the first chip transmit 4KB of compute result data to the receive core cluster functional cores 0,11 of the first chip.

At the end of the 3 rd first sub-cycle T3, the functional cores 2 and 9 in the cache core cluster have 20KB of computation result data cached in them.

By the method, the computing core cluster, the cache core cluster and the receiving core cluster can be synchronized in time sequence, the transmission speed between the computing core cluster and the cache core cluster in the chip is far higher than that between the cache core cluster and the receiving core cluster in the chip in the same time, and the data transmission between the computing core cluster and the cache core cluster in the chip is earlier than that between the cache core cluster and the receiving core cluster in the chip. The method is beneficial to the calculation core cluster to enter a calculation state in advance, and the efficiency of the processor system is improved.

In step S12, as shown in fig. 6, in the second working cycle (i.e., T4 to T7), the four functional cores (2,7, 2,8, 3,7, 3,8) of the first chip compute core cluster execute the compute task, and the first chip cache core cluster functional core 2,9 transmits the remaining compute result data 20KB (28KB to 8KB), i.e., the second data, to the functional core 0,11 of the second chip receive core cluster.

In a possible implementation manner, in the second working cycle, the timing between the computing core cluster and the caching core cluster and the timing between the caching core cluster and the receiving core cluster are asynchronous, and the timing between the caching core cluster and the receiving core cluster is synchronous.

For example, as shown in fig. 6, after the 3 rd first sub-period T3 is finished, the data transmission in the chip is completed, and in the second working period (i.e., the second sub-period T4 to T7), the compute core cluster of the first chip enters the compute state, and the inter-chip cache core cluster and the receive core cluster continue to perform the transmission task. That is, in the second working cycle, the cache core cluster of the first chip and the receiving core cluster of the second chip are synchronous, and the cache core cluster and the receiving core cluster of the second chip are asynchronous with the computing core cluster.

For example, as shown in fig. 6, the duration of the 1 st second sub-period T4 and the duration of the 2 nd second sub-period T5 may be set to be shorter, and the cache core cluster of the first chip may transmit 4KB of calculation result data to the receiving core cluster of the second chip; the duration of the 3 rd second sub-period T6 and the duration of the 4 th second sub-period T7 may be set to be longer, and the cache core cluster of the first chip may transmit 6KB of calculation result data to the receiving core cluster of the second chip. Meanwhile, the time sequence of the computing core cluster is asynchronous with the cache core cluster (or the receiving core cluster), and the working time sequence of the computing core cluster can be set according to the requirement of the computing task.

It should be understood that the cache core cluster and the receiving core cluster between the slices may set a synchronous timing sequence according to the needs of the transmission task, the computing core cluster may set an asynchronous timing sequence with the cache core cluster (or the receiving core cluster) according to the needs of the computing task itself, and the disclosure does not limit the setting of the specific timing sequence of each core cluster.

By the mode, in the second working period, the computing core cluster can asynchronously execute the computing task, and meanwhile, the cache core cluster and the receiving core cluster synchronously continue to execute the transmission task, so that the idle running and waiting time of each functional core between chips in the processor system can be effectively reduced, and the computing efficiency and the execution speed of the processor system are improved.

Therefore, compared with the data transmission method in the related art shown in fig. 5, the computing core cluster directly sends to the receiving core cluster outside the slice, and the computing core cluster always performs data transmission at a low speed due to the low inter-slice data transmission speed. Moreover, when the computation core cluster performs the computation task, the receiving core cluster needs to wait for idle running, and the efficiency is low.

According to the data transmission method in the embodiment of the disclosure, by setting the cache core cluster, the calculation result data of the calculation core cluster can be transmitted to the cache core cluster, and then transmitted to the receiving core cluster through the cache core cluster, and the calculation core cluster can enter a calculation state in advance due to the fact that the transmission speed of the data in the chip is high; and when the computing core cluster carries out computing tasks, the receiving core cluster does not need idle waiting, can continuously receive the computing result data of the cache core cluster, and improves the computing efficiency and the execution speed of the processor system.

On the basis of the data transmission method shown in fig. 6, the present disclosure may further utilize a multicast transmission method to further improve the inter-slice data transmission efficiency. The multicast transmission method according to the embodiment of the present disclosure can be described as follows by comparing the related art method with the method according to the embodiment of the present disclosure.

Fig. 7 is a diagram illustrating an inter-slice data transmission method in the related art. As shown in FIG. 7, the compute core cluster in chip 1 may include 28 functional cores (2, 7-15, 7, 2, 8-15, 8), and the receive core cluster in chip 2 may include 7 functional cores (15, 0-15, 6).

The computation core cluster of the chip 1 directly sends computation result data to the reception core cluster of the off-chip 2, where the computation result data of every 4 functional cores in the chip 1 corresponds to 1 functional core in the chip 2, for example, the functional cores 2,7, 2,8, 3,7, 3,8 may transmit the computation result to the same off-chip functional core 15, 0.

Because the output transmission amount between the chips is small, and the data transmission between the two chips is a serial mode in one direction, the data transmission can be carried out at a very low speed all the time by the computing kernel cluster only through one-direction transmission, the transmission time is huge, and the chip efficiency is reduced.

In view of the problem of low inter-slice transmission efficiency as shown in fig. 7, fig. 8 shows a schematic diagram of an inter-slice data transmission method according to an embodiment of the present disclosure. As shown in fig. 8, the functional cores included in the chips 1 to 4 are divided into a plurality of parts, and clustering can be performed, for example, according to a calculation task, a cache task, a reception task, and a multicast task.

Different core clusters complete different tasks, and the same task is completed in the core clusters. The data transmission can be carried out among different core clusters, the data transmission among the core clusters can be related to the upstream and downstream core clusters, the data transmission task can comprise data receiving and data sending, and each core cluster can carry out data transmission firstly and then carry out data calculation.

As shown in fig. 8, a chip 1 (i.e., a first chip) may include a compute core cluster and a cache core cluster, the compute core cluster may include 28 functional cores (2,7 to 15,7, 2,8 to 15,8), and the cache core cluster may include 7 functional cores (2,9, 4,9, 6,9, 8,9, 10,9, 12,9, 14, 9); the receiving core cluster in the chip 2 (namely the second chip) can comprise 7 functional cores (15, 0-15, 6); chip 3 (i.e., third chip) multicast core cluster includes 2 functional cores (0,11, 1, 11); the multicast core cluster in chip 4 (i.e., the third chip) includes 3 functional cores (0,11, 1,11, 2, 11).

The chip 1 caches 7 functional cores in the core cluster and can directly send the functional cores to the core clusters of the chips 2 to 4. And multicasting the data to the receiving core cluster of the chip 2 through a plurality of core clusters of the chips 3 and 4. As shown in fig. 8, chip 2 can receive partial data from the bottom, left and right sides 3, respectively, and these three partial data transmissions are in parallel.

In one possible implementation manner, the data is transmitted between different chips of the processor system in a serial manner, where in a first working cycle, the cache core cluster transmits first data of the computation result data to the receiving core cluster, including:

a first functional core in the cache core cluster transmits first subdata of the first data to the receiving core cluster; and a second functional core in the cache core cluster transmits second subdata of the first data to a multicast core cluster of a third chip, so that the multicast core cluster transmits the second subdata to the receiving core cluster, wherein the third chip is at least one chip except the first chip and the second chip in the processor system.

For example, in a first working cycle, as shown in fig. 8, a first functional core (6,9, 8,9) in the cache core cluster transmits first sub-data of first data to a functional core (15,2, 15,3) of the receiving core cluster;

the first subdata represents a part of first data transmitted by the first functional core (6,9, 8,9) in the first working cycle, that is, a part of calculation result data transmitted by the functional cores (6,7, 6,8, 7,8, 9,7, 9,8) in the calculation core cluster and received by the first functional core (6,9, 8, 9).

At the same time, the second functional core (2,9, 4,9, 10,9, 12,9, 14,9) in the cache core cluster transmits the second subdata of the first data to the multicast core cluster (chip 3 functional core 0,11, 1, 11; chip 4 functional core 0,11, 1,11, 2,11) of the third chip (chip 3, chip 4) so that the multicast core cluster can transmit the second subdata to the receiving core cluster.

The second sub-data represents data in the first data except the first sub-data, and the third chip may include a plurality of chips, and the number of the third chips is not limited by the present disclosure. For example, the third chip in fig. 8 may include chip 3 and chip 4, and each third chip may include a multicast core cluster.

The multicast core cluster (0,11, 1,11) in the chip 3 may receive part of the calculation result data (i.e., the second sub-data corresponding to the multicast core cluster of the chip 3) sent by the second functional core (2,9, 4,9) of the cache core cluster, and may transmit the data to the functional core (15,0, 15,1) of the receiving core cluster of the chip 2;

the multicast core cluster (0,11, 1,11, 2,11) in the chip 4 may receive part of the computation result data (i.e., the second sub-data corresponding to the multicast core cluster of the chip 4) sent by the second functional core (10,9, 12,9, 14,9) of the cache core cluster, and may transmit the data to the functional core (15,4, 15,5, 15,6) of the receiving core cluster of the chip 2.

In this case, compared with the case that the chip 1 in fig. 7 directly transmits the calculation result data to the chip 2, the multicast core cluster in fig. 8 is provided, which is helpful for transmitting the calculation result data to the chip 2 in parallel by the chip 1, the chip 3, and the chip 4 in the first working cycle, and can increase the data transmission speed between the chips by 3 times.

By the method, the parallel transmission among the chips can be realized in the first working period by setting the multicast core cluster, and the data transmission speed of the chips is improved.

In one possible implementation manner, in the second work cycle, the transmitting, by the cache core cluster, the second data of the computation result data to the receiving core cluster includes:

a first functional core in the cache core cluster transmits third subdata of the second data to the receiving core cluster;

and the second functional core in the cache core cluster transmits fourth sub-data of the second data to the multicast core cluster of the third chip, so that the multicast core cluster transmits the fourth sub-data to the receiving core cluster.

For example, in the second work cycle, as shown in fig. 8, the first functional core (6,9, 8,9) in the cache core cluster transmits the third sub-data of the second data to the functional core (15,2, 15,3) of the receiving core cluster;

the third sub-data represents a part of the second data transmitted by the first functional core (6,9, 8,9) in the second duty cycle, that is, data except the first sub-data in the calculation result data sent by the functional cores (6,7, 6,8, 7,8, 9,7, 9,8) in the calculation core cluster.

At the same time, the second functional core (2,9, 4,9, 10,9, 12,9, 14,9) in the cache core cluster transmits fourth sub-data of the second data to the multicast core cluster (chip 3 functional core 0,11, 1, 11; chip 4 functional core 0,11, 1,11, 2,11) of the third chip (chip 3, chip 4), so that the multicast core cluster can transmit the fourth sub-data to the receiving core cluster.

The fourth sub data represents a part of the second data excluding the third sub data, and the third chip may include a plurality of chips. For example, the third chip in fig. 8 may include chip 3 and chip 4, and each third chip may include a multicast core cluster.

The multicast core cluster (0,11, 1,11) in the chip 3 may receive the remaining part of the calculation result data (i.e., the fourth sub-data corresponding to the multicast core cluster of the chip 3) sent by the second functional core (2,9, 4,9) of the cache core cluster, and may transmit the data to the functional core (15,0, 15,1) of the receiving core cluster of the chip 2;

the multicast core cluster (0,11, 1,11, 2,11) in the chip 4 may receive the remaining part of the computation result data (i.e., the fourth sub-data corresponding to the multicast core cluster of the chip 4) sent by the second functional core (10,9, 12,9, 14,9) of the cache core cluster, and may transmit the data to the functional core (15,4, 15,5, 15,6) of the receiving core cluster of the chip 2.

In this case, compared with the case that the chip 1 in fig. 7 directly transmits the calculation result data to the chip 2, the multicast core cluster in fig. 8 is provided, which is helpful for transmitting the calculation result data to the chip 2 in parallel by the chip 1, the chip 3, and the chip 4 in the second working cycle, and the data transmission speed between the chips can be increased by 3 times.

By the method, the parallel transmission among the chips can be realized in the second working period by setting the multicast core cluster, and the data transmission speed of the chips is improved.

In one possible implementation, the method further includes: in a first working period, the multicast core cluster of the third chip transmits the second subdata to the receiving core cluster of the second chip; in a second working period, the multicast core cluster of the third chip transmits the fourth sub-data to the receiving core cluster of the second chip.

For example, as shown in fig. 8, the third chip may include chips 3 and 4, and the number of the third chips is not limited by the present disclosure.

In the first working period, the multicast core cluster of each third chip can transmit the second sub-data to the receiving core cluster of the second chip in parallel.

For example, the multicast core cluster (0,11, 1,11) in the chip 3 may transmit the corresponding second sub-data (i.e., sent by the second functional core 2,9, 4,9) to the functional core (15,0, 15,1) of the receiving core cluster of the chip 2; the multicast core cluster (0,11, 1,11, 2,11) in the chip 4 may transmit the corresponding second sub-data (i.e., sent by the second functional core 10,9, 12,9, 14,9) to the functional core (15,4, 15,5, 15,6) of the receiving core cluster of the chip 2.

In the second working period, the multicast core cluster of each third chip can transmit the fourth sub-data to the receiving core cluster of the second chip in parallel.

For example, the multicast core cluster (0,11, 1,11) in the chip 3 may transmit the corresponding fourth sub-data (i.e., sent by the second functional core 2,9, 4,9) to the functional core (15,0, 15,1) of the receiving core cluster of the chip 2; the multicast core cluster (0,11, 1,11, 2,11) in the chip 4 may transmit the corresponding fourth sub-data (i.e., sent by the second functional core 10,9, 12,9, 14,9) to the functional core (15,4, 15,5, 15,6) of the receiving core cluster of the chip 2.

By the mode, the parallel transmission among the chips can be realized through the multicast core cluster, and the data transmission speed of the chips is improved.

Therefore, according to the data transmission method of the embodiment of the disclosure, by setting the cache core cluster in the slice, the local computation task of each functional core and the data interaction between the cores can be prevented from being executed according to the globally synchronous time sequence, so that the data transmission between the computation core cluster and the cache core cluster can be started synchronously and ended asynchronously, the data computation task is started and executed asynchronously after the transmission task is ended, the execution efficiency of the chip can be greatly improved, and the idle waiting time of the computation cores in the processor system can be reduced. In addition, by arranging the multicast core cluster outside the chips, the parallel transmission of data among the chips can be realized, and the transmission efficiency of the data among the chips is further improved.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In one possible implementation, as shown in fig. 2, a processor system is provided, the processor system comprising a plurality of chips, each chip comprising a plurality of functional cores, the data transfer speed between functional cores located within the same chip being higher than the data transfer speed between functional cores located in different chips, wherein the plurality of functional cores of each chip are divided into a plurality of core clusters according to different tasks, and in the case of transmitting data from a first chip to a second chip, the first chip includes a computing core cluster for executing a computing task, and a cache core cluster for performing a data caching and transmitting task, the second chip including a receive core cluster for performing a data receiving task, the first chip and the second chip are any two chips of the processor system, wherein the first chip is configured to:

in a first working period, the computing core cluster transmits computing result data to the cache core cluster, and the cache core cluster transmits first data of the computing result data to the receiving core cluster; in a second working cycle, the computing core cluster executes a computing task, and the caching core cluster transmits second data of the computing result data to the receiving core cluster, wherein the second data comprises data except the first data in the computing result data.

In a possible implementation manner, in the first working cycle, timing synchronization is performed among the computing core cluster, the cache core cluster, and the receiving core cluster; in the second working cycle, the time sequence between the computing core cluster and the cache core cluster and between the computing core cluster and the receiving core cluster are asynchronous, and the time sequence between the cache core cluster and the receiving core cluster is synchronous.

In one possible implementation manner, in a first working cycle, the transmitting, by the compute core cluster, the compute result data to the cache core cluster includes: and under the condition that the data transmission of the calculation result is completed, ending the first work cycle.

In one possible implementation manner, the data is transmitted between different chips of the processor system in a serial manner, where in a first working cycle, the cache core cluster transmits first data of the computation result data to the receiving core cluster, including: a first functional core in the cache core cluster transmits first subdata of the first data to the receiving core cluster; and a second functional core in the cache core cluster transmits second subdata of the first data to a multicast core cluster of a third chip, so that the multicast core cluster transmits the second subdata to the receiving core cluster, wherein the third chip is at least one chip except the first chip and the second chip in the processor system.

In one possible implementation manner, in the second work cycle, the transmitting, by the cache core cluster, the second data of the computation result data to the receiving core cluster includes: a first functional core in the cache core cluster transmits third subdata of the second data to the receiving core cluster; and the second functional core in the cache core cluster transmits fourth sub-data of the second data to the multicast core cluster of the third chip, so that the multicast core cluster transmits the fourth sub-data to the receiving core cluster.

In one possible implementation, the system further includes: in a first working period, the multicast core cluster of the third chip transmits the second subdata to the receiving core cluster of the second chip; in a second working period, the multicast core cluster of the third chip transmits the fourth sub-data to the receiving core cluster of the second chip.

In a possible implementation manner, the first work cycle includes N first sub-cycles, where N is an integer greater than 1, and in the first work cycle, the transmitting, by the cache core cluster, the first data of the computation result data to the receiving core cluster includes: in a1 st first sub-period, the caching core cluster does not transmit data to the receiving core cluster; in the 2 nd first sub-cycle to the nth first sub-cycle, the caching core cluster transmits the first data to the receiving core cluster.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the above-described method.

According to another aspect of the present disclosure, there is provided an electronic device comprising the processor system described above.

Fig. 9 illustrates a block diagram of an electronic device 1200 in accordance with an embodiment of the disclosure. As shown in fig. 9, the electronic device 1200 includes a computing processing means 1202 (e.g., the processor system described above including a plurality of artificial intelligence chips), an interface means 1204, other processing means 1206, and a storage means 1208. Depending on the application scenario, one or more computing devices 1210 may be included in a computing processing device (e.g., artificial intelligence chips, where each chip may include multiple functional cores).

In one possible implementation, the computing processing device of the present disclosure may be configured to perform operations specified by a user. In an exemplary application, the computing processing device may be implemented as a single chip artificial intelligence processor or a multi-chip artificial intelligence processor. Similarly, one or more computing devices included within the computing processing device may be implemented as an artificial intelligence chip or as part of a hardware structure of an artificial intelligence chip. When a plurality of computing devices are implemented as artificial intelligence chips or as part of the hardware structure of artificial intelligence chips, the computing processing device of the present disclosure may be considered as having a single chip structure or a homogeneous multi-chip structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an artificial intelligence processor, and the like, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computational processing apparatus of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.

In one or more embodiments, the other processing devices may interface with external data and controls as a computational processing device of the present disclosure (which may be embodied as an artificial intelligence, such as a neural network computing related computing device), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computing tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.

Additionally or alternatively, the electronic device of the present disclosure may further comprise a storage means. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.

According to different application scenarios, the artificial intelligence chip disclosed by the disclosure can be used for a server, a cloud server, a server cluster, a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an automatic driving terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Fig. 10 shows a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 10, an electronic device 1900 includes a processing component 1922 (e.g., a processor system including multiple functional cores) that further includes one or more functional cores and memory resources, represented by memory 1932, for storing instructions, such as application programs, that are executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In the present disclosure, a space described as a separate part may or may not be physically separated, and a part shown as a space may or may not be a physical space. The aforementioned components or spaces may be co-located or distributed across multiple network spaces. In addition, according to actual needs, part or all of the space can be selected to achieve the purpose of the scheme of the embodiment of the disclosure. In addition, in some scenarios, multiple spaces in embodiments of the present disclosure may be integrated into one space or each space may exist physically separately.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The electronic device or processor of the present disclosure may also be applied to the fields of the internet, internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or the processor disclosed by the disclosure can also be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, a computationally powerful electronic device or processor according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or processor may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

25页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于Tsi721的RapidIO动态枚举过程的实现方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!