Integrated circuit device, electronic equipment, board card and calculation method

文档序号:1921446 发布日期:2021-12-03 浏览:15次 中文

阅读说明:本技术 集成电路装置、电子设备、板卡和计算方法 (Integrated circuit device, electronic equipment, board card and calculation method ) 是由 不公告发明人 于 2021-09-10 设计创作,主要内容包括:本公开涉及一种集成电路装置、电子设备、板卡和使用前述集成电路装置来执行计算的方法。该集成电路装置可以包括在组合处理装置的计算装置中,该计算装置可以包括一个或多个集成电路装置。前述的组合处理装置还可以包括接口装置和处理装置。所述计算装置与处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与计算装置和处理装置连接,用于存储该计算装置和处理装置的数据。本公开的方案可以降低内部设备与外部存储装置之间的数据传输量,由此最大程度地减少了由于带宽限制而导致的I/O瓶颈问题,从而可以提高集成电路装置的整体性能。(The present disclosure relates to an integrated circuit device, an electronic apparatus, a board, and a method of performing a calculation using the foregoing integrated circuit device. The integrated circuit device may be included in a computing device that combines processing devices, which may include one or more integrated circuit devices. The aforementioned combined processing means may further comprise interface means and processing means. The computing device interacts with the processing device to jointly complete computing operation designated by the user. The combined processing device may further comprise a storage device connected to the computing device and the processing device, respectively, for storing data of the computing device and the processing device. The scheme of the disclosure can reduce the data transmission quantity between the internal device and the external storage device, thereby reducing the I/O bottleneck problem caused by bandwidth limitation to the maximum extent, and improving the overall performance of the integrated circuit device.)

1. An integrated circuit device, comprising:

a data interface configured to transfer data between the integrated circuit device and an external memory;

a storage unit configured to:

obtaining first data from the external memory via the data interface and sending the first data to a plurality of computing units; and

storing a plurality of calculation results received from a plurality of calculation units;

a plurality of computing units, wherein each computing unit is configured to:

receiving first data from the external memory from the storage unit; and

and calculating according to the first data and second data stored in advance to obtain a calculation result, and sending the calculation result to the storage unit.

2. The integrated circuit device of claim 1, wherein each compute unit includes a master compute unit and a plurality of slave compute units, wherein:

the master computing unit is configured to:

receiving the first data from the storage unit;

dividing the first data into a plurality of first subdata and then respectively sending the plurality of first subdata to a plurality of corresponding slave computing units, wherein the dividing number of the first subdata is related to the number of the slave computing units contained in the computing unit;

receiving intermediate results from a corresponding plurality of slave computing units; and

calculating according to the intermediate results to obtain the calculation result and sending the calculation result to the storage unit; each slave computing unit is configured to:

calculating according to the first sub-data and pre-stored second sub-data to obtain the intermediate result, wherein the second sub-data is one of a plurality of sub-data formed by dividing the second data pre-stored in the calculating unit, and the number of sub-data formed by dividing the second data is associated with the number of slave calculating units contained in the calculating unit; and

and sending the intermediate result to the main computing unit.

3. The integrated circuit device of claim 2, wherein each of the plurality of slave computing units comprises a master computing subunit and a plurality of slave computing subunits, wherein:

the main computing subunit is configured to:

receiving the first sub data from the corresponding main computing unit;

dividing the first sub data into a plurality of first atomic data and then respectively sending the first atomic data to a plurality of corresponding slave computing subunits, wherein the division number of the first atomic data is related to the number of the slave computing subunits contained in the slave computing unit;

receiving intermediate sub-results from a corresponding plurality of slave computing sub-units; and

calculating according to the plurality of intermediate sub-results to obtain the intermediate result and sending the intermediate result to a corresponding main calculating unit, wherein the corresponding main calculating unit is a main calculating unit which sends the first sub-data to the main calculating sub-unit; each slave computing subunit is configured to:

calculating according to the first atomic data and second atomic data stored in advance to obtain the intermediate sub-result, wherein the second atomic data is one of a plurality of atomic data formed by dividing second sub-data stored in advance in the slave calculation unit, and the number of atomic data formed by dividing the second sub-data is associated with the number of slave calculation sub-units contained in the slave calculation unit; and

and sending the intermediate sub-result to the main computing subunit.

4. The integrated circuit device according to claim 2, further comprising a control unit configured to control information exchange between the master computing unit and the plurality of slave computing units in each of the computing units and data computation of the master computing unit and/or the plurality of slave computing units.

5. The integrated circuit device according to claim 4, the control unit further configured to control the master computing unit in each of the computing units to receive the first data from the storage unit and to send a result of the computation to the storage unit.

6. The integrated circuit device according to any of claims 1-5, wherein each of the plurality of computational units is respectively located on a different first data output channel such that the second data is not multiplexed between the plurality of computational units.

7. The integrated circuit device according to any one of claims 1-5, wherein the first data is neuron data of a neural network, the second data is weights of the neural network, and each of the plurality of calculation units is further configured to perform a convolution calculation according to the neuron data and a pre-stored weight to obtain the calculation result.

8. An electronic device comprising an integrated circuit arrangement according to any of claims 1-7.

9. A board card comprising an integrated circuit device according to any of claims 1-7.

10. A method of performing a computation using an integrated circuit device, wherein the integrated circuit device comprises a data interface for passing data between the integrated circuit device and an external memory, a storage unit for retrieving first data from the external memory via the data interface and sending to a plurality of computation units, and a plurality of computation units; the method comprises performing the following operations at each computing unit:

receiving first data from the external memory from the storage unit;

calculating according to the first data and second data stored in advance to obtain a calculation result; and

and sending the calculation result to the storage unit so as to store the calculation result.

Technical Field

The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to an integrated circuit device, an electronic apparatus, a board, and a calculation method.

Background

With the development of the field of artificial intelligence, the operation data volume related to the large-scale neural network is larger and larger, and the requirement on the storage volume is higher and higher, for example, operations such as convolution weight gradient in back propagation and the like. In the conventional operation method, an operation is generally performed by a processor such as a central processing unit ("CPU") or a graphic processing unit ("GPU"). However, even if the parallel computing method is adopted, since the processor is limited by the capacity of the internal register resource, the huge amount of data computation may cause a large amount of data interaction between the processor and the external storage device, thereby reducing the computation and processing efficiency of the device. The parallel operation efficiency is greatly reduced because of the limited bandwidth of the input/output ("I/O") bus, which may present a serious I/O bottleneck problem. In addition, not only can the bandwidth limitations of the I/O bus become a performance bottleneck, but also the large amount of I/O access between the processor and the external storage device can cause large computational and power consumption overhead.

Disclosure of Invention

To solve at least the above-mentioned technical problem, the present disclosure provides a solution that can reduce the amount of data transfer with an external storage device, minimizing the I/O bottleneck problem caused by bus bandwidth limitation. In particular, the present disclosure provides the aforementioned solutions in a number of aspects as follows.

In a first aspect, the present disclosure provides an integrated circuit device comprising:

a data interface configured to transfer data between the integrated circuit device and an external memory;

a storage unit configured to:

obtaining first data from the external memory via the data interface and sending the first data to a plurality of computing units; and

storing a plurality of calculation results received from a plurality of calculation units;

a plurality of computing units, wherein each computing unit is configured to:

receiving first data from the external memory from the storage unit; and calculating according to the first data and second data stored in advance to obtain a calculation result, and sending the calculation result to the storage unit.

In a second aspect, the present disclosure provides an electronic device comprising the integrated circuit arrangement of the aforementioned and later described embodiments.

In a third aspect, the present disclosure provides a board card comprising an integrated circuit device according to the various embodiments described before and later on.

In a third aspect, the present disclosure provides a method of performing a computation using an integrated circuit device, wherein the integrated circuit device comprises a data interface for transferring data between the integrated circuit device and an external memory, a storage unit for retrieving first data from the external memory via the data interface and sending the first data to a plurality of computation units, and a plurality of computation units; the method comprises performing the following operations at each computing unit:

receiving first data from the external memory from the storage unit;

calculating according to the first data and second data stored in advance to obtain a calculation result; and

and sending the calculation result to the storage unit so as to store the calculation result.

By using the integrated circuit device, the electronic device, the board card and the calculation method disclosed by the disclosure, both the received data (for example, the first data) and the calculated result (for example, the calculated result) can be stored in the storage unit, so that only interaction with the storage unit is required when the data and the result are called for corresponding operation, and interaction with an external memory is not required. Therefore, the data throughput of the integrated circuit device for data interaction with the external memory through the I/O bus can be reduced, so that the problem of bandwidth bottleneck of the I/O bus is solved, and the overall performance of the integrated circuit device, the electronic equipment or the board card can be improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating an integrated circuit device according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram illustrating an internal architecture of a single core computing device, according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating an internal architecture of a multi-core computing device according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a structure of an integrated circuit device according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating data transfer of an integrated circuit device according to an embodiment of the disclosure;

FIG. 8 is a schematic block diagram illustrating a computing unit in accordance with an embodiment of the present disclosure;

FIG. 9 is a schematic architectural diagram illustrating a slave computing unit in accordance with an embodiment of the present disclosure;

FIG. 10 is a schematic diagram illustrating another integrated circuit device in accordance with an embodiment of the present disclosure; and

FIG. 11 is a simplified flow diagram illustrating a method of performing a calculation using an integrated circuit device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed in the present disclosure without making any creative effort, shall fall within the protection scope of the solution disclosed in the present disclosure.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram illustrating a board card 10 according to an embodiment of the disclosure. It should be understood that the configuration and composition shown in FIG. 1 is merely an example, and is not intended to limit the aspects of the present disclosure in any way.

As shown in fig. 1, board 10 includes a Chip 101, which may be a System on Chip (SoC), i.e., a System on Chip as described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices. The combined processing device can be an artificial intelligence operation unit, is used for supporting various deep learning and machine learning algorithms, meets the intelligent processing requirements in complex scenes in the fields of computer vision, voice, natural language processing, data mining and the like, and particularly applies deep learning technology to the field of cloud intelligence in a large quantity. One of the significant characteristics of cloud-based intelligent application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high, whereas the board card 10 of the embodiment is suitable for cloud-based intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

As further shown in the figure, the chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like, according to different application scenarios. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 may also include a memory device 104 for storing data, including one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 may be configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 according to the above-described embodiment. As shown in fig. 2, the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a Dynamic Random Access Memory (DRAM) DRAM 204.

The computing device 201 may be configured to perform user-specified operations, primarily implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations, it may be used to perform calculations in terms of deep learning or machine learning, and may also interact with the processing means 203 through the interface means 202 to collectively complete the user-specified operations. In aspects of the present disclosure, the computing device may be configured to perform various tasks of the optimized neural network model, such as various operations that will be described later in the present disclosure.

The interface device 202 may be used to transfer data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the Processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose Processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure. According to aspects of the present disclosure, when implemented as a general-purpose processor, the processing device 203 may perform a compilation operation for optimizing the neural network model in order to compile the neural network model into a sequence of binary instructions executable by the computing device.

The DRAM 204 is used for storing data to be processed, and is a DDR memory, which is typically 16G or larger in size and is used for storing data of the computing device 201 and/or the processing device 203.

Fig. 3 shows an internal structure diagram of the computing apparatus 201 as a single core. The single-core computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the single-core computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution. The storage module 33 is used to store or transport related data, and includes a Neuron storage unit (Neuron RAM, NRAM)331, a parameter storage unit (Weight RAM, WRAM)332, and a Direct Memory Access (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM 204 via the bus 34, and is responsible for data transfer between the single-core computing device 301 and the DRAM 204.

Fig. 4 shows a schematic diagram of the internal structure of the computing apparatus 201 with multiple cores. The multi-core computing device 41 is designed in a hierarchical structure, with the multi-core computing device 41 being a system on a chip that includes at least one cluster (cluster) according to the present disclosure, each cluster in turn including a plurality of processor cores. In other words, the multi-core computing device 41 is constructed in a system-on-chip-cluster-processor core hierarchy. In a system-on-chip hierarchy, as shown in FIG. 4, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.

There may be multiple (2 as shown in the figure for example) external memory controllers 401, which are used to respond to the access request issued by the processor core and access the external memory device, i.e. the off-chip memory (e.g. DRAM 204 in fig. 2) in the context of this disclosure, so as to read data from or write data to the off-chip. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute a task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402 and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a Global synchronization Barrier Controller (GBC) for coordinating the operation progress of each cluster and ensuring the synchronization of information. The plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41. Although 4 clusters are exemplarily shown in fig. 4, as hardware evolves, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405. In one application scenario, the cluster 405 may be used to efficiently execute a deep learning algorithm.

Looking at the cluster hierarchy, as shown in fig. 4, each cluster 405 may include a plurality of processor cores (IPU core)406 and a memory core (MEM core)407, which may include, for example, a cache memory (e.g., LLC) as described in the context of the present disclosure.

The processor cores 406 are exemplarily shown as 4 in the figure, the present disclosure does not limit the number of the processor cores 406, and the internal architecture thereof is as shown in fig. 5. Each processor core 406 is similar to the single core computing device 301 of fig. 3, and as such may include three modules: a control module 51 (including an instruction fetch unit 511 and an instruction decode unit 512), an operation module 52 (including a vector operation unit 521 and a matrix operation unit 522), and a storage module 53 (including NRAM 531, WRAM 532, IODMA 533, and MVDMA 534). The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and are not described herein again. It should be particularly noted that the storage module 53 may include an Input/Output Direct Memory Access (IODMA) module 533 and a transport Direct Memory Access (MVDMA) module 534. IODMA 533 controls access of NRAM 531/WRAM 532 and DRAM 204 through broadcast bus 409; the MVDMA 534 is used to control access to the NRAM 531/WRAM 532 and the memory cell (SRAM) 408.

Returning to FIG. 4, the storage core 407 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 406, as well as perform communications between the cluster 405 and the DRAM 204, communications among each other cluster 405, communications among each other processor cores 406, and the like. In other embodiments, the memory core 407 may have the capability of scalar operations to perform scalar operations.

The Memory core 407 may include a Static Random-Access Memory (SRAM)408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) 410, and a Global Direct Memory Access (GDMA) 411. In one implementation scenario, SRAM 408 may assume the role of a high performance data transfer station. Thus, data multiplexed between different processor cores 406 within the same cluster 405 need not be individually obtained by the processor cores 406 to the DRAM 204, but rather is relayed between the processor cores 406 via the SRAM 408. Further, the memory core 407 only needs to quickly distribute multiplexed data from the SRAM 408 to the plurality of processor cores 406, so that it is possible to improve inter-core communication efficiency and significantly reduce off-chip input/output access.

Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication among processor cores 406, communication among cluster 405, and data transfer between cluster 405 and DRAM 204, respectively. As will be described separately below.

The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405, and the broadcast bus 409 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM 408 to a particular number of processor cores 406, and broadcast is a communication that transfers a copy of data from SRAM 408 to all processor cores 406, which is a special case of multicast.

CDMA 410 is used to control access to SRAM 408 between different clusters 405 within the same computing device 201. The GDMA 411 cooperates with the external memory controller 401 to control the access of the SRAM 408 of the cluster 405 to the DRAM 204 or to read data from the DRAM 204 into the SRAM 408. As can be seen from the foregoing, communication between DRAM 204 and NRAM 431 or WRAM 432 may be achieved via 2 ways. The first way is to communicate with the NRAM 431 or WRAM 432 directly with the DRAM 204 through the IODAM 433; the second way is to transmit data between the DRAM 204 and the SRAM 408 through the GDMA 411, and transmit data between the SRAM 408 and the NRAM 431 or WRAM 432 through the MVDMA 534. Although the second approach may require more components and longer data flow, in some embodiments, the bandwidth of the second approach is substantially greater than that of the first approach, and thus it may be more efficient to perform communication between DRAM 204 and NRAM 431 or WRAM 432 in the second approach. It is understood that the data transmission schemes described herein are merely exemplary, and those skilled in the art can flexibly select and adapt various data transmission schemes according to the specific arrangement of hardware in light of the teachings of the present disclosure.

In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. Although the present disclosure considers GDMA 411 and IODMA 533 as different components for convenience of description, it will be within the scope of protection of the present disclosure for a person skilled in the art as long as the achieved functions and technical effects are similar to the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA 410, and MVDMA 534 may be implemented by the same component.

From the above description in connection with the embodiments, those skilled in the art will understand that the present disclosure also discloses an electronic device or apparatus, which may include one or more of the above boards, one or more of the above chips and/or one or more of the above combined processing devices.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like.

Further, the electronic device or apparatus of the present disclosure may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, the electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

The hardware architecture and its internal structure of the present disclosure are described in detail above in conjunction with fig. 1-5. It is to be understood that the above description is intended to be illustrative, and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also change the board card and the internal structure of the present disclosure, and these changes still fall into the protection scope of the present disclosure. The scheme of the present disclosure will be described in detail below.

Fig. 6 is a schematic diagram illustrating a structure of an integrated circuit device 600 according to an embodiment of the disclosure. To illustrate the relationship between the integrated circuit device 600 and the external memory, an external memory 606 is also shown in FIG. 6. As shown in fig. 6, the integrated circuit device 600 may include a data interface 601, a storage unit 602, and a plurality of computing units. The plurality of calculation units may be, for example, the calculation unit 603, the calculation unit 604 …, and the calculation unit 605 shown in fig. 6. The data interface 601 may be configured to transfer data between the integrated circuit device 600 and the external memory 606. In an application scenario, a Direct Memory Access ("DMA") interface may be used as the aforementioned data interface 601, so as to transmit data of the external Memory 606 to the plurality of computing units through the storage unit 602.

In one embodiment, the storage unit 602 may be configured to obtain the first data from the external memory 606 via the data interface 601 and send the first data to the plurality of computing units, and it may further store a plurality of computing results received from the plurality of computing units.

Based on different application scenarios, the number of the computing units may be any positive integer greater than or equal to 2, such as 2, 3, or 5, so as to meet the operation requirements of different application scenarios. For example, when 3 nodes in the system need to obtain corresponding calculation results, the number of the calculation units may be set to 3, so that the 3 calculation units respectively calculate the corresponding calculation results of the 3 nodes. Similarly, when 5 nodes in the system need to obtain corresponding calculation results, 5 calculation units can be set.

In one embodiment, each of the plurality of computing units may be configured to receive the first data from the external memory 606 from the storage unit 602. Further, each of the computing units may be further configured to perform a computation according to the first data and second data stored in advance to obtain a computation result, and send the computation result to the storage unit 602.

The first data may for example comprise neuron data in a neural network and the second data may for example comprise weights in the neural network. Further, the first data and the second data may be vector, matrix, multidimensional (three-dimensional or four-dimensional and above) data. The first data and/or the second data of the present disclosure may each include one or more data blocks according to different data structures (e.g., different data placement rules). In one operational scenario, when the integrated circuit device of the present disclosure is used for matrix computation, the first data and/or the second data may also be a block of a certain size in the matrix for block-wise parallel computation of the matrix.

In an implementation scenario, each of the computing units may perform a corresponding computing task according to the first data and the pre-stored second data based on requirements of different application scenarios to obtain a computing result. The calculation task may be, for example, one or any combination of convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, bias operation, full join operation, GEMM operation, GEMV operation, and activation operation, which are involved in the artificial intelligence field (e.g., neural network).

To better understand how the integrated circuit device of the present disclosure performs operations, the data processing method of the embodiment of the present disclosure is described below by taking the integrated circuit device 700 shown in fig. 7 as an example. For simplicity of description, fig. 7 illustrates, by way of example only, a case where the integrated circuit device 700 includes three computing units. As shown in fig. 7, the three calculation units are a calculation unit 703, a calculation unit 704, and a calculation unit 705, respectively.

Assume that the integrated circuit device 700 performs a correlation calculation for a neural network and that the input data for the neural network is ina, which is stored in an external memory, such as the external memory 706. The calculation unit 703 has the second data b1 stored therein in advance, the calculation unit 704 has the second data b2 stored therein in advance, and the calculation unit 705 has the second data b3 stored therein in advance. In an initial stage of performing the operation, the external memory 706 may transmit the input data ina to the storage unit 702 via the data interface 701 as first data for transmission to the plurality of calculation units described above.

After obtaining the first data ina, the computing unit 703 may perform a corresponding computing task (e.g., convolution operation) according to the first data ina and the second data b1 in the computing unit 703, so as to obtain a computing result c 1. After obtaining the first data ina, the computing unit 704 may perform a corresponding computing task (e.g., matrix-by-matrix operation) according to the first data ina and the second data b2 in the computing unit 704, so as to obtain a computing result c 2. Like the two calculating units, the calculating unit 705 can perform a corresponding calculating task (e.g. GEMM operation) according to the first data ina and the second data b3 pre-stored therein, and obtain a calculating result c 3. It will be understood by those skilled in the art that the computing tasks performed in each of the computing units described above are only exemplary and not limiting, and other computing tasks listed above may also be performed based on different application scenarios, and will not be described herein again.

As can be seen from the above description, the integrated circuit device of the present disclosure may transmit the first data to the storage unit and transmit them to the plurality of calculation units, respectively, for calculation thereof. Thus, the data throughput of the integrated circuit device for data interaction with the external memory via the I/O bus may be reduced, thereby overcoming the bottleneck problem of the I/O bus bandwidth. In addition, the respective computing units in the disclosed solution can perform data reception and data computation in parallel, and thus the computation speed of each computing unit can be guaranteed, thereby guaranteeing the data processing efficiency of the integrated circuit device.

After the computing units 703, 704, and 705 compute the computing results c1, c2, and c3, the computing units may further send the computing results to the storage unit 702 for storage, so as to facilitate subsequent other operations (e.g., computation) performed by the computing results. Therefore, the method only needs to perform data interaction with the storage unit when the calculation result is called, so that the data throughput of the integrated circuit device performing data interaction with the external memory through the I/O bus is reduced, and the problem of bandwidth bottleneck of the I/O bus is further solved.

As can be seen from the above description, in the integrated circuit device of the present disclosure, both the received data (e.g., the above first data) and the calculated result (e.g., the above calculated result) may be stored in the storage unit, so that only the storage unit needs to be interacted with when the above data and result are called for corresponding operations, and no external memory needs to be interacted with. Accordingly, the disclosed solution may reduce the data throughput of the integrated circuit device for data interaction with the external memory through the I/O bus, thereby overcoming the problem of I/O bus bandwidth bottleneck.

In one embodiment, each of the computing units described above may include a master computing unit and a plurality of slave computing units. Based on this, the master computing unit may be configured to receive the first data from the storage unit, divide the first data into a plurality of first sub-data, and send the first sub-data to the corresponding plurality of slave computing units. In one implementation scenario, the number of partitions of the first sub data may be associated with the number of slave computing units included in the computing unit, for example, the number of partitions of the first sub data may be equal to the number of slave computing units included in the computing unit, so that each slave computing unit may be assigned to the first sub data for subsequent computation.

In addition, the master computing unit may be further configured to receive intermediate results from a corresponding plurality of slave computing units, calculate the computing results according to the intermediate results, and send the computing results to the storage unit for storage. Correspondingly, each slave computing unit may be configured to perform a computation according to the first sub-data and pre-stored second sub-data to obtain the intermediate result, and send the intermediate result to the master computing unit. The second sub-data is one of a plurality of sub-data formed by dividing the second data pre-stored in the computing unit, and the number of sub-data formed by dividing the second data may be associated with the number of slave computing units included in the computing unit. For example, the number of sub-data formed by the second data division may be equal to the number of slave computing units included in the computing unit, so that each slave computing unit may be assigned to the second sub-data and perform corresponding computation according thereto.

To further illustrate the architecture and functionality of the computing unit, the computing unit 800 shown in FIG. 8 is described in detail below. As can be seen from fig. 8, the computing unit 800 may include a master computing unit 801 and a plurality of slave computing units. The plurality of slave computing units may be, for example, the slave computing unit 802, the slave computing unit 803 … slave computing unit 804 shown in fig. 8. The number of slave computing units may be set to different numbers according to requirements of different application scenarios, for example, the number may be 2, 4, or 5, and the like, which is not limited by the embodiment of the present disclosure.

Based on this, the main computing unit 801 may be configured to receive the first data from the aforementioned storage unit. For example, if the computing unit 800 is the computing unit 703 in fig. 7, it may receive the first data ina from the storage unit 702. Upon receiving the first data ina, the master computing unit 801 may divide it into a plurality of first sub data a11, a12 … a13 equal to the number of slave computing units. Next, the master computing unit 801 may transmit the plurality of first sub-data a11, a12 … a13 to the slave computing unit 802, the slave computing unit 803 …, respectively, and the slave computing unit 804.

Assume that the second data stored in advance in the calculation unit 800 is b 1. Based on the division of the first data ina, accordingly, the second data b1 may also be divided into a plurality of second sub data b11, b12 … b13 equal to the number of slave computing units and stored in the corresponding slave computing units, respectively. In the present embodiment, it is set that the second sub data b11 is stored in the slave computing unit 802, the second sub data b12 is stored in the slave computing unit 803 …, and the second sub data b13 is stored in the slave computing unit 804.

After the above operations of data division and transmission are performed, the slave computing unit 802 may perform corresponding computing subtasks according to the first sub-data a11 and the second sub-data b11, and obtain an intermediate result c 11. The slave computing unit 803 may perform corresponding computing subtasks according to the first sub data a12 and the second sub data b12, resulting in an intermediate result c 12. By analogy, the slave computing unit 804 may execute corresponding computing subtasks according to the first sub data a13 and the second sub data b13, and obtain an intermediate result c 13. After obtaining the plurality of intermediate results, the slave computing units 802, 803 … may send the corresponding intermediate results c11, c12 … c13 to the master computing unit 801 from the computing unit 804. Then, the main computing unit 801 may calculate (e.g., sum) the calculated result c1 according to the intermediate results c11 and c12 … c13, and may further send the calculated result c1 to a storage unit (e.g., the storage unit 702 in fig. 7) for storage, so as to be used in a subsequent operation.

As can be seen from the above description, since the intermediate result of each computing unit can also be stored in the computing unit without being stored in the external memory, data exchange between the integrated circuit device and the external memory can be reduced, and thus the I/O bottleneck due to the external bandwidth limitation can be reduced. In addition, the computing tasks of the computing unit are divided into a plurality of computing subtasks and are respectively executed by the main computing unit and the plurality of slave computing units, so that the data processing efficiency is improved.

Further, in order to facilitate storage of data such as the intermediate result, in an embodiment, each of the computing units may further include a storage unit. The storage unit may store intermediate results received from the host computing unit. In addition, in addition to storing the intermediate result, the storage unit may further store the first data, the second data, the plurality of first sub-data formed by dividing the first data, the plurality of second sub-data formed by dividing the second data, the calculated result, and the like, which are received by the main calculation unit.

The architecture and data processing flow of the computing unit are described above with reference to the drawings. The present disclosure may also similarly build the internal architecture of the slave computing unit described above according to the architecture of the computing unit described above. For example, each of the plurality of slave computing units described above may also include a master computing subunit and a plurality of slave computing subunits. Based on this, the master computing sub-unit may be configured to receive the first sub-data from the corresponding master computing unit, divide the first sub-data into a plurality of first atomic data, and send the first atomic data to the corresponding plurality of slave computing sub-units. In one implementation scenario, the number of partitions of the first atomic data may be associated with the number of slave computing subunits contained by the slave computing unit, e.g., the number of partitions of the first atomic data may be equal to the number of slave computing subunits contained by the slave computing unit, such that each slave computing subunit may be assigned to the first atomic number for subsequent computation.

In addition, the above-mentioned main calculating subunit may be further configured to receive intermediate sub-results from a corresponding plurality of slave calculating subunits, calculate the aforementioned intermediate results according to a plurality of the intermediate sub-results, and send the intermediate results to a corresponding main calculating unit, so that the main calculating subunit calculates a final calculating result according to the intermediate results. The corresponding main computing unit here is a main computing unit that transmits the first sub-data to the main computing subunit. Correspondingly, each slave computing subunit may be configured to perform a computation based on the first atomic data and pre-stored second atomic data to obtain the intermediate sub-result, and send the intermediate sub-result to the master computing subunit. The second atomic data here is one of a plurality of atomic data formed by dividing second sub data stored in advance in the slave computing unit, and the number of atomic data formed by dividing the second sub data may be associated with the number of slave computing sub units included in the slave computing unit. For example, the number of atomic data formed by dividing the second child data may be equal to the number of slave computing subunits included in the slave computing unit, so that each slave computing subunit may be assigned to the second atomic data and perform corresponding computation based thereon.

To further illustrate the architecture and functionality of the slave computing unit, a detailed description will be given below by taking the slave computing unit 900 shown in fig. 9 as an example. As can be seen from fig. 9, the slave computing unit 900 may include a master computing subunit 901 and a plurality of slave computing subunits. The plurality of slave computing units may be, for example, a plurality of slave computing subunits 902, 903 … slave computing subunit 904 shown in fig. 9. The number of slave computing subunits can be set to different numbers according to the requirements of different application scenarios, for example, the number of slave computing subunits can be 2, 4, or 6, and the like, which is not limited by the embodiment of the present disclosure.

Based on this, the main computing subunit 901 may be configured to receive the first sub data from a corresponding main computing unit. For example, if the slave computing unit 900 is the slave computing unit 802 in fig. 8, it may receive the first sub-data a11 from the master computing unit 801. Upon receiving the first child data a11, the main computing subunit 901 may divide the first child data a11 into a plurality of first atomic data a111, a112 … a113 equal to the number of slave computing subunits in a data processing manner similar to the aforementioned computing units. Next, the master computing subunit 901 can send the plurality of first atomic data a111, a112 … a113 to the plurality of slave computing subunits 902, 903 … slave computing subunits 904, respectively.

Assume that the second sub-data prestored from the computing unit 900 is b 11. Based on the division of the first sub-data a11, the second sub-data b11 may be divided into a plurality of second atomic data b111, b112 … b113 equal to the number of slave calculation subunits, and stored in the corresponding slave calculation subunits, respectively. In the present embodiment, it is set that the second atomic data b111 is stored in the slave calculation subunit 902, the second atomic data b112 is stored in the slave calculation subunit 903 …, and the second atomic data b113 is stored in the slave calculation subunit 904.

After the above operations such as data division and transmission are performed, the slave computing subunit 902 may perform corresponding computing subtasks according to the first atomic data a111 and the second atomic data b111 to obtain an intermediate sub result c 111. The slave computing subunit 903 may perform corresponding computing subtasks based on the first atomic data a112 and the second atomic data b112, resulting in an intermediate sub-result c 112. By analogy, the slave computing subunit 904 can execute corresponding computing subtasks according to the first atomic data a113 and the second atomic data b113, and obtain an intermediate sub-result c 113. After obtaining the plurality of intermediate sub-results, the slave computing sub-unit 902 and the slave computing sub-unit 903 … may respectively send the corresponding intermediate sub-results c111, c112 … c113 to the master computing sub-unit 901. Then, the main calculating subunit 901 may calculate an intermediate result c11 according to the plurality of intermediate sub-results c111, c112 … c113 and may further send the intermediate result c11 to the corresponding main calculating unit (e.g. the above-mentioned main calculating unit 801) so as to calculate a final calculating result (e.g. c1) according to the intermediate result c111, c112 … c 113.

As can be seen from the above description, since the intermediate sub-result of each slave computing unit of the present disclosure can also be stored in the slave computing unit without being stored in the external memory, data exchange between the integrated circuit device and the external memory can be reduced, and thus the I/O bottleneck due to the external bandwidth limitation can be reduced. In addition, the computing subtasks of the slave computing units are further divided into a plurality of secondary computing subtasks and are respectively executed by the main computing subunit and the plurality of slave computing subunits, so that the data processing efficiency is improved.

Further, in order to facilitate storage of data such as the intermediate sub-result, in an embodiment, each of the slave computing units may further include a storage unit. The storage unit may store intermediate sub-results received from the main computing sub-unit. In addition, in addition to storing the intermediate sub-results, the storage unit may store first sub-data, second sub-data, a plurality of first atomic data formed by dividing the first sub-data, a plurality of second atomic data formed by dividing the second sub-data, the intermediate results, and the like, which are received by the main computing sub-unit.

Various data processing by the computing unit are described above in connection with various embodiments. In one embodiment, to enable reliable data processing, the integrated circuit device may further comprise a control unit. In one embodiment, the control unit may be configured to control information interaction between the master computing unit and the plurality of slave computing units in each computing unit and data computation of the master computing unit and/or the plurality of slave computing units. Fig. 10 is a schematic diagram illustrating another integrated circuit device 1000 according to an embodiment of the disclosure. As can be seen from fig. 10, the integrated circuit device 1000 in this embodiment may include a storage unit 1001, a plurality of calculation units, and a control unit 1005. The plurality of calculation units may be, for example, the calculation unit 1002, the calculation unit 1003 …, and the calculation unit 1004 shown in fig. 10. The architecture, functions, and the like of the storage unit 1001 and the computing unit in this embodiment may refer to the foregoing embodiments, and are not described herein again.

In one embodiment, the control unit 1005 may be configured to control information interaction between the master computing unit and the plurality of slave computing units in the computing units 1002, 1003 … and 1004 and data computation of the master computing unit and/or the plurality of slave computing units in these computing units. For simplicity of description, the present embodiment will be described by taking only the control of the calculation unit 1002 as an example. As for the calculation unit 1002, the control unit 1005 may control the master calculation unit 1010 in the calculation unit 1002 to divide the received first data into a plurality of first sub-data and transmit them to the slave calculation unit 1011, the slave calculation unit 1012 …, and the slave calculation unit 1013, respectively. In addition, the control unit 1005 may also control the master computing unit 1010 to receive intermediate results from the plurality of slave computing units, and to compute a computing result according to the plurality of intermediate results.

Further, the control unit 1005 may also control the slave computing unit 1011, the slave computing unit 1012 … to perform computation according to the received first sub-data and the pre-stored second sub-data from the computing unit 1013 to obtain an intermediate result, and control the intermediate result to be transmitted to the master computing unit 1010. Therefore, reliable information interaction and data calculation can be carried out in the calculating unit through the control of the control unit, and reliable data processing of the calculating unit is guaranteed.

Further, in order to ensure information interaction between the computing unit and the storage unit to achieve reliable data reading and storage, in one embodiment, the control unit may be further configured to control the master computing unit in each computing unit to receive the first data from the storage unit and to send the computing result to the storage unit. Still taking the computing unit 1002 in fig. 10 as an example, the control unit 1005 may be further configured to control the main computing unit 1010 in the computing unit 1002 to receive the first data from the storage unit 1001 and to transmit the computing result to the storage unit 1001, so as to ensure reliable data reading and storage between the computing unit 1002 and the storage unit 1001.

In the above embodiments, only part of the control functions of the control unit are exemplarily listed, and in other embodiments, the control unit may further control other information interaction between the master computing unit and the plurality of slave computing units, data calculation, and data reading and storing with other devices (e.g., a storage unit), so as to further ensure reliable data processing of the computing unit and reliable data reading and storing with external devices.

To compute the computation results at different nodes (e.g., on different first data output channels), in one embodiment, each of the plurality of computation units may be located on a different first data output channel, respectively, such that the plurality of computation units do not multiplex the second data therebetween. Based on this, the disclosed solution may perform a calculation using the first data and the second data on each data output channel (first data output channel) to obtain a calculation result on a different data output channel. It is assumed that neuron output data on different neuron output channels in the neural network are calculated, and first data is set as neuron data and second data is set as weights corresponding to the respective neuron output channels. In this case, the plurality of calculation units may perform calculation according to the received neuron data and the corresponding weight of each neuron output channel, respectively, so as to obtain neuron output data on different neuron output channels. It can be seen that aspects of the present disclosure can be used to compute multiple output data in a system having multiple data output channels.

As mentioned in the above embodiments, the first data in the present disclosure may be neuron data of the neural network, and the second data may be weight values of the neural network. Based on this, each of the plurality of calculation units may be further configured to perform convolution calculation according to the neuron data and the pre-stored weight to obtain the calculation result, so as to meet the requirement of the corresponding application scenario. It can be seen that, for different application scenarios, each of the plurality of computing units may be further configured to perform other computations according to the neuron data and the pre-stored weights to obtain the computation result, where the other computations may be, for example, one or any combination of the matrix multiplication matrix operation, the matrix multiplication vector operation, the bias operation, the full-join operation, the GEMM operation, the GEMV operation, and the activation operation involved in the aforementioned neural network, and will not be described in detail herein.

FIG. 11 is a simplified flow diagram illustrating a method of performing a calculation using an integrated circuit device according to an embodiment of the present disclosure. From the foregoing, it will be appreciated that the integrated circuit device herein may be the integrated circuit device described in connection with the foregoing embodiments, having the illustrated interconnections and supporting additional classes of operations.

As shown in fig. 11, the method 1100 may include receiving first data from the external memory from the aforementioned storage unit at step S1101. Next, the method 1100 performs step S1102, and performs calculation according to the first data and the second data stored in advance to obtain a calculation result. Finally, the method 1100 performs step S1103 to transmit the calculation result to the storage unit so as to be stored.

The calculation method of the present disclosure is described above only in conjunction with fig. 11 for the sake of simplicity. Those skilled in the art can also appreciate that the method may include more steps according to the disclosure of the present disclosure, and the execution of the steps may implement various operations of the present disclosure described in conjunction with the embodiments, and thus, the detailed description is omitted here.

It is noted that for the sake of brevity, this disclosure presents some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the aspects of the disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of being practiced in other than the specifically disclosed embodiments, and that the acts or modules illustrated herein are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the related description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause 1, an integrated circuit device, comprising:

a data interface configured to transfer data between the integrated circuit device and an external memory;

a storage unit configured to:

obtaining first data from the external memory via the data interface and sending the first data to a plurality of computing units; and

storing a plurality of calculation results received from a plurality of calculation units;

a plurality of computing units, wherein each computing unit is configured to:

receiving first data from the external memory from the storage unit; and calculating according to the first data and second data stored in advance to obtain a calculation result, and sending the calculation result to the storage unit.

Clause 2, the integrated circuit device of clause 1, wherein each computing unit comprises a master computing unit and a plurality of slave computing units, wherein:

the master computing unit is configured to:

receiving the first data from the storage unit;

dividing the first data into a plurality of first subdata and then respectively sending the plurality of first subdata to a plurality of corresponding slave computing units, wherein the dividing number of the first subdata is related to the number of the slave computing units contained in the computing unit;

receiving intermediate results from a corresponding plurality of slave computing units; and

calculating according to the intermediate results to obtain the calculation result and sending the calculation result to the storage unit;

each slave computing unit is configured to:

calculating according to the first sub-data and pre-stored second sub-data to obtain the intermediate result, wherein the second sub-data is one of a plurality of sub-data formed by dividing the second data pre-stored in the calculating unit, and the number of sub-data formed by dividing the second data is associated with the number of slave calculating units contained in the calculating unit; and

and sending the intermediate result to the main computing unit.

Clause 3, the integrated circuit device of clause 2, wherein each of the plurality of slave computing units comprises a master computing subunit and a plurality of slave computing subunits, wherein:

the main computing subunit is configured to:

receiving the first sub data from the corresponding main computing unit;

dividing the first sub data into a plurality of first atomic data and then respectively sending the first atomic data to a plurality of corresponding slave computing subunits, wherein the division number of the first atomic data is related to the number of the slave computing subunits contained in the slave computing unit;

receiving intermediate sub-results from a corresponding plurality of slave computing sub-units; and

calculating according to the plurality of intermediate sub-results to obtain the intermediate result and sending the intermediate result to a corresponding main calculating unit, wherein the corresponding main calculating unit is a main calculating unit which sends the first sub-data to the main calculating sub-unit;

each slave computing subunit is configured to:

calculating according to the first atomic data and second atomic data stored in advance to obtain the intermediate sub-result, wherein the second atomic data is one of a plurality of atomic data formed by dividing second sub-data stored in advance in the slave calculation unit, and the number of atomic data formed by dividing the second sub-data is associated with the number of slave calculation sub-units contained in the slave calculation unit; and

and sending the intermediate sub-result to the main computing subunit.

Clause 4, the integrated circuit device according to clause 2, further comprising a control unit configured to control information interaction between the master computing unit and the plurality of slave computing units in each of the computing units and data computation of the master computing unit and/or the plurality of slave computing units.

Clause 5, the integrated circuit device according to clause 4, the control unit further configured to control the master computing unit in each of the computing units to receive the first data from the storage unit and to transmit a result of the computation to the storage unit.

Clause 6, the integrated circuit device of any of clauses 1-5, wherein each of the plurality of computational units is respectively located on a different first data output channel such that the second data is not multiplexed between the plurality of computational units.

Clause 7, the integrated circuit device according to any one of clauses 1-5, wherein the first data is neuron data of a neural network, the second data is a weight of the neural network, and each of the plurality of calculation units is further configured to perform a convolution calculation according to the neuron data and a pre-stored weight to obtain the calculation result.

Clause 8, an electronic device, comprising the integrated circuit apparatus of any one of clauses 1-7.

Clause 9, a board card comprising the integrated circuit device of any one of clauses 1-7.

Clause 10, a method of performing a computation using an integrated circuit device, wherein the integrated circuit device comprises a data interface for transferring data between the integrated circuit device and an external memory, a storage unit for retrieving first data from the external memory via the data interface and sending to a plurality of computation units, and a plurality of computation units; the method comprises performing the following operations at each computing unit:

receiving first data from the external memory from the storage unit;

calculating according to the first data and second data stored in advance to obtain a calculation result; and

and sending the calculation result to the storage unit so as to store the calculation result.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

25页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种用于RDMA通信设备的DMA通信系统及方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!