Chip cascade and parallel computing system

文档序号：1921450 发布日期：2021-12-03 浏览：14次中文

阅读说明：本技术 一种芯片级联并行计算系统 (Chip cascade and parallel computing system ) 是由刘远于 2020-05-27 设计创作，主要内容包括：本发明提供了一种芯片级联并行计算系统,所述系统包括：计算控制模块,计算阵列,环型数据通路和星型数据通路；其中,计算控制模块从控制接口通过星型数据通路给计算阵列的每个计算单元配置工作模式；计算控制模块从PCIE数据接口接收待计算数据；计算控制模块从环形数据接口通过环形数据通路把数据发送到计算阵列的第一个计算单元,从计算阵列的最后一个计算单元的环形数据接口通过环形数据通路接收计算结果数据；计算控制模块通过PCIE接口输出结果数据与反馈数据。(The invention provides a chip cascade and parallel computing system, which comprises: the system comprises a calculation control module, a calculation array, a ring data path and a star data path; the computing control module configures a working mode for each computing unit of the computing array from the control interface through the star data path; the calculation control module receives data to be calculated from the PCIE data interface; the calculation control module sends data to a first calculation unit of the calculation array from the annular data interface through an annular data path, and receives calculation result data from the annular data interface of the last calculation unit of the calculation array through the annular data path; and the calculation control module outputs result data and feedback data through the PCIE interface.)

1. A chip-level parallel computing system, the system comprising: the system comprises a calculation control module, a calculation array, a ring data path and a star data path; the computing control module configures a working mode for each computing unit of the computing array from the control interface through the star data path; the calculation control module receives data to be calculated from the PCIE data interface; the calculation control module sends data to a first calculation unit of the calculation array from the annular data interface through an annular data path, and receives calculation result data from the annular data interface of the last calculation unit of the calculation array through the annular data path; and the calculation control module outputs result data and feedback data through the PCIE interface.

2. The chip cascade and parallel computing system according to claim 1, wherein the computing control module is implemented by an FPGA or ASIC chip.

3. The chip-based serial and parallel computing system according to claim 1, wherein said computing control module supports a PCIE interface, a MIPI/LVDS interface, and a SPI/I2C/UART interface.

4. The chip-based serial-parallel computing system of claim 1, wherein the computing array is a collection of N computing units, each computing unit is connected to another computing unit by a high-speed serial differential interface, the unified interface transmits data, and the input data and the output data are transmitted through the interface bus.

5. The chip-based serial-parallel computing system of claim 1, wherein each computing unit is a separate SOC/ASIC chip, said chip having a computing unit built therein and supporting MIPI/LVDS interface and SPI/I2C/UART interface.

6. The system of claim 1, wherein the ring datapath: the ring data path is a high-speed data interface for connecting the calculation control module with each calculation unit; inside the computing module, the interface connection between each computing unit belongs to a part of a ring type data path; the interface connection between the calculation control module and the first calculation unit belongs to a part of a ring data path; the interface between the computation control module and the last computation unit is also part of the ring data path.

7. The chip-cascaded, parallel computing system of claim 6, wherein the ring datapath is implemented over a MIPI or LDVS differential high-speed interface.

8. The chip-based serial and parallel computing system of claim 1, wherein said star datapath is configured differently for each computing unit than said ring datapath, and is a low speed, differentiated point-to-point communication link.

9. The system of claim 8, wherein the differentiated configuration comprises setting identity information of each computing unit, configuring a working mode of the bus arbitration module, configuring a computing task, and starting and stopping computing.

10. The system of claim 1, wherein the computation control module sends computation packets only to the ring data path; the calculation unit may send the calculation packet to the ring data path, or send the result packet to the ring data path.

Technical Field

The invention relates to the technical field of parallel computing, in particular to a chip-level parallel computing system.

Background

Today's society is a highly digital society, especially with the continuous development and evolution of mobile communication technology. The MIPI/DVP/BT carries general video streams, which can also be used as high-speed data ports. Common terms in the prior art include:

PCI-express (peripheral component interconnect express) is a high-speed serial computer expansion bus standard, PCIE belongs to high-speed serial point-to-point double-channel high-bandwidth transmission, connected equipment distributes independent channel bandwidth and does not share bus bandwidth, and the PCI-express mainly supports functions of active power management, error report, end-to-end reliable transmission, hot plug, quality of service (QOS) and the like.

A Mobile Industry Processor Interface (MIPI) is an open standard initiated by the MIPI alliance and established for Mobile application processors. MIPI is tailored for power sensitive applications, specifically in high speed (data transfer) mode, using low amplitude signal swing. The MIPI alliance defines a set of interface standards that standardize the interfaces inside the mobile device, such as cameras, display screens, baseband, radio frequency interfaces, etc., thereby increasing design flexibility while reducing cost, design complexity, power consumption and EMI. Because the MIPI adopts differential signal transmission, strict design is required in design according to the general rule of differential design, and the key is to realize matching of differential impedance, and the MIPI protocol specifies that the value of the differential impedance of the transmission line is 80-125 ohms.

How to effectively improve the efficiency and effectively utilize the differential high-speed data transmission to realize the function of data in the chip cascade parallel computation becomes a problem to be solved urgently.

Disclosure of Invention

In order to solve the problems in the prior art, the present invention aims to: the computing system can be used for accelerating deep neural network computation, voice intelligent algorithm computation, mathematical computation and block chain computation.

To achieve the above object, the present application provides a chip-level parallel computing system, the system comprising: the system comprises a calculation control module, a calculation array, a ring data path and a star data path; the computing control module configures a working mode for each computing unit of the computing array from the control interface through the star data path; the calculation control module receives data to be calculated from the PCIE data interface; the calculation control module sends data to a first calculation unit of the calculation array from the annular data interface through an annular data path, and receives calculation result data from the annular data interface of the last calculation unit of the calculation array through the annular data path; and the calculation control module outputs result data and feedback data through the PCIE interface.

The calculation control module is realized by an FPGA or an ASIC chip.

The calculation control module supports a PCIE interface, an MIPI/LVDS interface and an SPI/I2C/UART interface.

The computing array is a set of N computing units, the computing units are connected by a high-speed serial differential interface, the unified interface transmits data, and input data and output data are transmitted through the interface bus.

Each computing unit is an independent SOC/ASIC chip, and the chip is internally provided with the computing unit and supports an MIPI/LVDS interface and an SPI/I2C/UART interface.

The ring-type data path: the ring data path is a high-speed data interface for connecting the calculation control module with each calculation unit; inside the computing module, the interface connection between each computing unit belongs to a part of a ring type data path; the interface connection between the calculation control module and the first calculation unit belongs to a part of a ring data path; the interface between the computation control module and the last computation unit is also part of the ring data path.

The ring data path is realized by MIPI or LDVS differential high-speed interface.

The star data path is configured differently for each computing unit, is different from the ring data path, and is a low-speed and differentiated point-to-point communication link.

The differential configuration comprises the steps of setting the identity information of each computing unit, configuring the working mode of the bus arbitration module, configuring the computing task and starting and stopping the computing.

The calculation control module only sends a calculation data packet to the ring data path; the calculation unit may send the calculation packet to the ring data path, or send the result packet to the ring data path.

The invention has the advantages that: by adopting the system, taking deep neural network calculation as an example, a CNN acceleration engine is arranged in each calculation unit, each calculation unit provides the calculation force of 8Tops, and the calculation force of the total plate 128T can be achieved through 16-unit cascade connection. In this way, the method has better flexibility, higher cost performance and better energy consumption ratio than a GPGPU or an FPGA.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

Fig. 1 is a schematic diagram of the structure of the system of the present invention.

FIG. 2 is a diagrammatic representation of a ring data path in the system of the present invention.

Detailed Description

The system structure of the invention is shown in figure 1. The system consists of the following parts: the system comprises a computation control module, a computation array, a ring data path and a star data path.

Specifically, a chip-level and parallel computing system, the system comprising: the system comprises a calculation control module, a calculation array, a ring data path and a star data path; the computing control module configures a working mode for each computing unit of the computing array from the control interface through the star data path; the calculation control module receives data to be calculated from the PCIE data interface; the calculation control module sends data to a first calculation unit of the calculation array from the annular data interface through an annular data path, and receives calculation result data from the annular data interface of the last calculation unit of the calculation array through the annular data path; and the calculation control module outputs result data and feedback data through the PCIE interface.

The calculation control module: the module is realized by an FPGA or an ASIC chip, and supports PCIE interfaces, MIPI/LVDS interfaces and the like, and SPI/I2C/UART interfaces and the like. The flow of the module is as follows: 1. through a star data path, 2, configuring a working mode for each computing unit of the computing array, 3, receiving data to be computed from PCIE, 4, sending the data to the first computing unit of the computing array through a ring data interface, 5, receiving computing result data from the ring data interface of the last technical unit of the technical array, 6, outputting the result data and feedback data to the outside of the system through PCIE

Calculating an array: a compute array is a collection of several compute units. Each computing unit is an independent SOC/ASIC chip, a high-performance computing unit is arranged in the chip, and interfaces such as MIPI/LVDS and interfaces such as SPI/I2C/UART are supported. The computing units are connected by a high-speed serial differential interface, the unified interface transmits data, and input data and output data are transmitted through the interface bus. The number of compute units contained within a compute array can be flexibly selected depending on the strength of the application, for example: 4 calculation units form a technical array, 16 calculation units form a technical array, 32 calculation units form a technical array and the like. However, in the implementation of the board card, due to the limitation of the area of the PCB, the same calculation array cannot increase the calculation units infinitely.

Ring data path: the ring data path is a high-speed data interface connecting the computation control module with each computation unit. Inside the computing module, the interface connection between each computing unit belongs to a part of a ring type data path; the interface connection between the calculation control module and the first calculation unit belongs to a part of a ring data path; the interface between the computation control module and the last computation unit is also part of the ring data path. The data throughput rate of the ring data path is high, and some of the implementation techniques are through differential high-speed interfaces such as MIPI or LDVS. Taking the MIPI interface as an example, the technical details of the ring data path are shown in fig. 2.

In the structure, the MIPI-input high-speed differential bus data is input into the arbitration module, the arbitration module judges whether a certain part or all of the data stream enters the memory of the unit, and the data stream which does not enter the unit is continuously sent out from the MIPI-output and flows to the next calculation unit. And the data entering the unit is delivered to a calculation engine to calculate a result, and the result data is returned to the memory and then sent to the next calculation unit from the MIPI-output. When the computing engine is in the computing state and does not reach the time of outputting the result, the computing unit is in the busy state; otherwise, the idle state is set.

In this example, the data in the MIPI high-speed differential bus may simultaneously include any one or more of the following types:

1. this packet is calculation data in which ID information of the calculation unit that receives this packet is explicitly identified. The data packet is only received by the computing unit bus arbitration module containing the ID, and other computing units only execute top transmission operation on the data packet; if there are no eligible compute units, the packet is passed back to the compute control module.

2. This packet is calculation data, but does not have ID information identifying the calculation unit that received the packet. The data packet is received by the computing unit in the first idle state, and the computing unit in the busy state only performs the overhead transmission operation on the data packet; if there are no eligible compute units, the packet is passed back to the compute control module.

3. The data packet is result data. All computing units perform only overhead transfer operations on this packet.

The above control logic is realized mainly by the linkage of the bus arbitration module in each computing unit, that is, the above logic is the main working mode of the bus arbitration module in the computing unit.

In general, the calculation control module only sends a calculation data packet to the ring data path; the calculating unit may send the calculating data packet to the sending ring data path, and may also send the result data packet to the sending ring data path, so as to implement the relay calculating function.

The above is only an example of the ring data path implemented by the MIPI bus protocol, and other high-speed buses such as LVDS/BT1120 may be used instead of the MIPI bus. The ring data path has the greatest characteristic that a high-speed bus protocol is utilized to uniformly transmit calculation data and result data, and two data paths do not need to be separated.

Star data path: the star data path has the main functions of performing differentiated configuration on each computing unit, such as setting the identity information of each computing unit, configuring the working mode of the bus arbitration module, configuring the computing task, starting and stopping the computing and the like. The star data path is different from the ring data path, and is a low-speed and differentiated point-to-point communication link. Therefore, only common interfaces such as SPI/I2C/UART and the like are needed to realize the function.

The modules are organically combined to form a chip cascade parallel computing system.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

6页详细技术资料下载

Chip cascade and parallel computing system

相关技术

网友询问留言