Bridge, acceleration equipment interconnection system and data acceleration processing method

文档序号：1904476 发布日期：2021-11-30 浏览：22次中文

阅读说明：本技术 一种桥接器、加速设备互连系统及数据加速处理方法 (Bridge, acceleration equipment interconnection system and data acceleration processing method ) 是由王赟张官兴郭蔚黄康莹张铁亮于 2021-11-02 设计创作，主要内容包括：本发明提供一种桥接器、加速设备互连系统及数据加速处理方法,应用于人工智能技术领域,其中桥接器包括第一多路复用器、第二多路复用器、多级开关互连网络、输入端口和输出端口,输入端口和输出端口成对与加速设备连接,第一多路复用器设置于输入端口与多级开关互连网络之间,用于在输入控制信号的控制下,将输入端口与多级开关互连网络连通；第二多路复用器设置于多级开关互连网络与输出端口之间,用于在输出控制信号的控制下,将多级开关互连网络与输出端口连通。通过所述桥接器,不仅可实现桥接器的端口互连,也可实现成对加速设备之间点对点的无阻塞数据交换通信,可灵活支持神经网络部署应用到基于加速芯片、加速芯片组等加速设备中。(The invention provides a bridge, an accelerating equipment interconnection system and a data accelerating processing method, which are applied to the technical field of artificial intelligence, wherein the bridge comprises a first multiplexer, a second multiplexer, a multilevel switch interconnection network, an input port and an output port, the input port and the output port are connected with accelerating equipment in pairs, and the first multiplexer is arranged between the input port and the multilevel switch interconnection network and is used for communicating the input port with the multilevel switch interconnection network under the control of an input control signal; the second multiplexer is arranged between the multilevel switch interconnection network and the output port and used for communicating the multilevel switch interconnection network with the output port under the control of the output control signal. Through the bridge, not only can the port interconnection of the bridge be realized, but also the point-to-point non-blocking data exchange communication between the pair of accelerating devices can be realized, and the neural network deployment can be flexibly supported and applied to accelerating devices based on accelerating chips, accelerating chip sets and the like.)

1. A bridge, comprising: a first multiplexer, a second multiplexer, a multi-level switch interconnect network, an input port, and an output port;

the input port and the output port are paired to be used as a transceiving port so as to be connected with an acceleration device with a transceiving communication port;

the first multiplexer is a single-input and multi-output multiplexer and is arranged between the input port and the multi-stage switch interconnection network, each output of the first multiplexer is respectively connected with different multi-stage switch interconnection networks, and the first multiplexer is used for communicating the input port with the multi-stage switch interconnection network corresponding to the input control signal under the control of the input control signal;

the second multiplexer is a multi-input and single-output multiplexer, and is disposed between the multilevel switch interconnection network and the output port, and each input of the second multiplexer is connected to a different multilevel switch interconnection network, where the second multiplexer is configured to communicate the multilevel switch interconnection network corresponding to the output control signal with the output port under the control of the output control signal.

2. The bridge of claim 1, wherein the multilevel switch interconnect network employs any one of the following control strategies: a level control strategy, a unit control strategy, and a partial level control strategy.

3. The bridge of claim 1, wherein the input control signal comprises a first time division multiplexed signal;

and/or the output control signal comprises a second time division multiplexing signal.

4. The bridge of claim 1, further comprising: a first output buffer queue disposed between the multilevel switch interconnect network and the second multiplexer, wherein the first output buffer queue is used for buffering data transmitted by interconnecting a plurality of input ports and the output ports.

5. The bridge of claim 1, further comprising: a first input buffer queue disposed between the first multiplexer and the multi-stage switch interconnection network, wherein the first input buffer queue is used for buffering data transmitted by the input port and the plurality of output ports.

6. The bridge of claim 1, further comprising: and the routing control unit is used for receiving the data of the input port, analyzing the received data to obtain a destination address, and generating configuration instruction information of the multilevel switch interconnection network according to a source address and the destination address, wherein the configuration instruction information is used for configuring the multilevel switch interconnection network.

7. The bridge of claim 6, wherein the routing control unit comprises a routing table, and wherein the routing table is configured to store a plurality of routing table entries for generating the configuration instruction information via the routing table entries.

8. The bridge according to claim 1, wherein the transceiving ports are transceiving ports that are identified and arranged according to a preset numbering policy, so as to interconnect with the corresponding acceleration devices according to the arranged identifiers.

9. An acceleration device interconnection system, comprising a plurality of acceleration devices and a bridge according to any one of claims 1 to 8, wherein a data sending end of each acceleration device is connected to a corresponding input port in the bridge, and a data receiving end of each acceleration device is connected to a corresponding output port in the bridge, and the acceleration devices are devices of a neural network model deployment application.

10. A data acceleration processing method applied to the acceleration device interconnection system according to claim 9, the data acceleration processing method comprising:

generating a calculation graph corresponding to the neural network according to a neural network model to be accelerated, and segmenting the calculation graph according to a preset splitting strategy to form a plurality of network units;

and loading the segmented network units into each accelerating device in the accelerating device interconnection system according to a preset deployment strategy so as to perform accelerated reasoning operation on the neural network model.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a bridge, an accelerating equipment interconnection system and a data accelerating processing method.

Background

With the development of AI (Artificial Intelligence) technology in various fields (such as algorithms, acceleration hardware, etc.), convolutional neural network models have been gradually applied to various industries, such as the fields of face recognition, security monitoring, automatic driving, voice recognition, etc.

At present, when a neural network model is deployed and applied to a multi-chip interconnection system, data communication needs to be performed between multiple chips, for example, when an input layer of the neural network model is deployed on a chip a and a hidden layer is deployed on a chip B, the chip B needs to acquire data from the chip a at this time, that is, there is interconnection between the chip a and the chip B to perform data communication. Moreover, in practical deployment applications, when deployment schemes are different, interconnection relationships among chips are required to be different, and the existing scheme is constrained by a bridge scheme, so that when a neural network model is deployed in a multi-chip interconnection system, practical deployment cannot be flexibly supported.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a bridge for interconnection in neural network model deployment, and an acceleration device interconnection system and a data acceleration processing method for performing interconnection application based on the bridge, where the bridge can implement interconnection between acceleration devices, improve flexibility of neural network model deployment application, and facilitate popularization and application of a neural network in devices such as a terminal and edge computing.

The embodiment of the specification provides the following technical scheme:

embodiments of the present description provide a bridge, which may include: a first multiplexer, a second multiplexer, a multi-level switch interconnect network, an input port, and an output port;

the input port and the output port are paired to be used as a transceiving port so as to be connected with an acceleration device with a transceiving communication port;

In one embodiment, the multilevel switch interconnection network employs any one of the following control strategies: a level control strategy, a unit control strategy, and a partial level control strategy.

In one embodiment, the input control signal comprises a first time division multiplexed signal; and/or the output control signal comprises a second time division multiplexing signal.

In one embodiment, the bridge further comprises: a first output buffer queue disposed between the multilevel switch interconnect network and the second multiplexer, wherein the first output buffer queue is used for buffering data transmitted by interconnecting a plurality of input ports and the output ports.

In one embodiment, the bridge further comprises: a first input buffer queue disposed between the first multiplexer and the multi-stage switch interconnection network, wherein the first input buffer queue is used for buffering data transmitted by the input port and the plurality of output ports.

In one embodiment, the bridge further comprises: and the routing control unit is used for receiving the data of the input port, analyzing the received data to obtain a destination address, and generating configuration instruction information of the multilevel switch interconnection network according to a source address and the destination address, wherein the configuration instruction information is used for configuring the multilevel switch interconnection network.

In one embodiment, the routing control unit includes a routing table, and the routing table is configured to store a plurality of routing table entries, so as to generate the configuration instruction information through the routing table entries.

In one embodiment, the transceiver port is a transceiver port which is identified and arranged according to a preset numbering strategy, and is interconnected with the corresponding acceleration device according to the arranged identification.

An embodiment of the present specification further provides an acceleration device interconnection system, which may include a plurality of acceleration devices and the bridge described in any one of the foregoing, where a data sending end of the acceleration device is connected to a corresponding input port in the bridge, and a data receiving end of the acceleration device is connected to a corresponding output port in the bridge, where the acceleration device is a device for deploying application in a neural network model.

An embodiment of the present specification further provides a data acceleration processing method, which is applicable to the acceleration device interconnection system according to any one of the preceding claims, and the data acceleration processing method may include:

Compared with the prior art, the beneficial effects that can be achieved by the at least one technical scheme adopted by the embodiment of the specification at least comprise:

the interconnection of a plurality of accelerating devices (such as accelerating chips, chip sets and the like) is realized through the bridge, namely, the arbitrary interconnection between the accelerating devices is realized through the control of control signals according to the data communication requirement through the multiplexer and the multilevel switch interconnection network which are included in the bridge, and the point-to-point non-blocking data exchange communication between paired devices is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of two network topologies with multiple chips interconnected based on a bus;

fig. 2 is a schematic structural diagram of a bridge-based interconnection scheme provided in an embodiment of the present specification;

fig. 3 is a schematic structural diagram of a bridge according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of two-stage switch interconnections in a bridge according to an embodiment of the present disclosure;

fig. 5 is a schematic diagram of an internal connection state of a switch interconnection network in a bridge according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a bridge according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a bridge according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an acceleration device interconnection system provided in an embodiment of the present specification;

fig. 9 is a flowchart of a data acceleration processing method provided in an embodiment of the present specification.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present application, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number and aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present application, and the drawings only show the components related to the present application rather than the number, shape and size of the components in actual implementation, and the type, amount and ratio of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details. The terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features described as being defined as "first," "second," etc., may explicitly or implicitly include one or more of the features. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

At present, a convolutional neural network used in an artificial intelligence application scenario is of a multilayer network topology structure, interlayer data connection is realized through a plurality of convolutional kernels (weight parameters) in each layer, each convolutional kernel can share output data of a plurality of output channels in the previous layer (namely, each convolutional kernel needs to be subjected to convolutional summation operation with characteristic data of an input channel), the output data is used as input data of the next layer of convolutional operation after convolution, summation, pooling and activation, and the number of the convolutional kernels can be the same as that of the output characteristic data channels. The operation is circularly carried out until the data such as the characteristic vector data or the classification result and the like are output, and the reasoning operation process of the whole neural network is completed.

For example, the network size of the conventional convolutional neural network designed based on the GPU, TPU, etc. may be very large (e.g., hundreds of megabytes), large model (e.g., tens of megabytes), etc., where data communication inside the model is more enormous (e.g., more than gigabytes).

As shown in fig. 1, although a chipset scheme or a multi-core architecture scheme may be formed by interconnecting a plurality of small computing power acceleration processors through a bus, for example, a chipset or a multi-core architecture single chip for Neural network acceleration Processing is formed by a CPU (central Processing Unit) and an NPU (Neural-network Processing Unit) on the bus through an interconnect 1 and/or an interconnect 2 in the figure, the bus for communication interconnection may be PCI/PCIe/SRIO/SPI/I²C/UART/GPIO and the like.

However, in practical applications, the method is often limited by some limiting conditions (such as transmission speed, connection relationship between acceleration chips, etc.) of a bus protocol and an interconnection topology (such as the 2 topology interconnection structures shown in the figure), not only is the implementation cost high, but also the communication bandwidth between acceleration processors (i.e. acceleration chips) is limited, and it is difficult to meet the application requirements of neural network model deployment applied to computationally-intensive acceleration processors.

Therefore, the existing application schemes cannot deploy and apply large-scale or even ultra-large-scale convolutional neural network models with low cost, and the models cannot be deployed even in a terminal based on a small-computing-power acceleration processor, edge computing and other equipment, or cannot meet the requirements of neural network real-time acceleration reasoning and operation even if the models are deployed successfully.

In addition, in some scenarios, such as scenarios with extremely high requirements on safety (e.g., automatic driving), multiple redundancy design is also required, which further limits the deployment and application of large-scale and even ultra-large-scale neural network models.

In view of the above, after intensive research and improvement on neural network models and interconnection technologies, a bridge scheme for interconnection of multiple acceleration chips is proposed: as shown in fig. 2, the bridge includes 2k (k is a positive integer, and k in the structure shown in the figure may take a value of 4) communication link ports for interconnecting with k acceleration devices (such as an acceleration chip, an acceleration chipset, etc.), where k input ports and k output ports of the bridge are provided, and one input port and one output port of the bridge may be paired to form a set of transceiving ports responsible for interconnecting with an acceleration device having transceiving communication ports, that is, each set of transceiving ports of the bridge is responsible for interfacing with one acceleration device (or a set of transceiving communication ports of the acceleration device), so as to implement interconnection between the input ports and the output ports in the bridge, and implement interconnection between the acceleration devices.

For example, a first set of transceiving ports of the bridge (e.g., a first set of ports formed by input and output ports numbered 1 in the bridge) is interconnected with a port a of the acceleration device a (e.g., a port a formed by input and output ports numbered 1 in the bridge), a second set of transceiving ports of the bridge (e.g., a second set of ports formed by input and output ports numbered 2 in the bridge) is interconnected with a port B of the acceleration device B (e.g., a port B formed by input and output ports numbered 2 in the communication link port of the acceleration device B), a third set of transceiving ports of the bridge (e.g., a third set of ports formed by input and output ports numbered 3 in the bridge) is interconnected with a port C of the acceleration device C (e.g., a port C formed by input and output ports numbered 3 in the communication link port of the acceleration device C), a fourth set of transceiving ports of the bridge (e.g., a fourth set of ports consisting of an input and an output numbered 4 in the bridge in the figure) is interconnected with a port D of the acceleration device D (e.g., a port D consisting of an input and an output numbered 4 in the communication link port of the acceleration device D in the figure).

In an implementation, as shown in fig. 3, the bridge may include a multiplexer and a multi-stage switch interconnection network, where the multiplexer at the input end is interconnected with the multiplexer at the output end through the multi-stage switch interconnection network, so that when a complex data exchange is required, dynamic control may be implemented on each switch state through a preset algorithm, that is, data exchange may be implemented by implementing arbitrary interconnection between multiple acceleration devices through controlling an input control signal and an output control signal.

It should be noted that the number of the multiplexers and the interconnection network of the multilevel switches may be determined according to the acceleration devices that need to be interconnected, and is not limited herein.

In an implementation, the multiplexer at the input port of the bridge may be a single-input and multi-output multiplexer, which is disposed between the input port and the multi-stage switch interconnection network, and configured to communicate the input port with the multi-stage switch interconnection network corresponding to the input control signal under the control of the input control signal, that is, receive output data (i.e., input data at an input end of the bridge) output by the same acceleration device (or a group of transceiving communication ports of the acceleration device), transmit the input data to the corresponding multi-stage switch interconnection network under the control of the input control signal, and further transmit the input data to the multiplexer at the output port of the bridge and finally transmit the input data to the corresponding acceleration device under the control of the output control signal. The multiplexer at the output port can be a multi-input and single-output multiplexer, and is arranged between the multilevel switch interconnection network and the output port, and each input of the second multiplexer is respectively connected with different multilevel switch interconnection networks, wherein the second multiplexer is used for communicating the multilevel switch interconnection network corresponding to the output control signal with the output port under the control of the output control signal.

For example, when 4 acceleration devices (i.e. acceleration device a to acceleration device D) in the foregoing example exchange data after being interconnected by the bridge as shown in fig. 3, for example, the output port of acceleration device a is connected to the input port 1 of the bridge, now considering that the multiplexers of the input port 1 of the bridge are each connected to three multilevel switch interconnection networks inside the bridge, multiplexers, such as input port 1, are connected to the inputs of the three multi-way switch interconnect networks, respectively, so that under control of an input control signal, the data of acceleration device a may be transmitted to any one of three multilevel switch interconnect networks within the bridge, and whereas the multiplexers inside the bridge at the output ports are also connected to the three multilevel switch interconnect networks inside the bridge, the output of the acceleration device a can thus be transmitted to the other three acceleration devices (hypotheses) under the control of the output control signal.

In implementation, in the multi-stage switch interconnection network, each switch interconnection network may be an N-dimensional (N is a positive integer) switch interconnection network, and each switch interconnection network may further implement internal switch state communication under control of a control signal provided by a control policy.

In an implementation, the multilevel switch interconnection network may adopt a two-level switch interconnection network as shown in fig. 4, where the two-level switch interconnection network is a bridge based on a two-level regular tetrahedron interconnection network, and the multilevel switch interconnection network may implement interconnection between each of the 4 input ports and the output port under the control of a control signal, for example, the input port 1 is communicated with the output port 1 through the first switch and the second switch, the input port 2 is communicated with the output port 2 through the first switch and the fourth switch, the input port 3 is communicated with the output port 3 through the third switch and the second switch, and the input port 4 is communicated with the output port 4 through the third switch and the fourth switch.

It should be noted that the form of the multilevel switch interconnection network may be selected according to the actual interconnection requirement, and is not limited herein; the number of the multilevel switch interconnection networks, the number of levels of each multilevel switch interconnection network, and the like can be determined according to application requirements, and are not limited herein.

In implementation, a switch interconnection network as shown in fig. 5 may be used as a basic unit of a multi-stage switch interconnection network, wherein the switch interconnection network has 2 input ports and 2 output ports, and the internal states of the switch interconnection network may be respectively one of the following 4 states under the control of a control signal: (a) direct-feeding, that is, an input port is communicated with a corresponding output port, for example, an input port 0 is communicated with an output port 0, and an input port 1 is communicated with an output port 1; (b) crossing, that is, the input port is communicated with the non-corresponding output port, for example, the input port 0 is communicated with the output port 1, and the input port 1 is communicated with the output port 0; (c) the method includes the steps of (1) up-broadcasting, namely, an input port is communicated with a plurality of output ports, for example, the input port 0 is communicated with the output port 0 and the output port 1; (d) drop, i.e. the input port communicates with a plurality of output ports, such as input port 0 communicating with output port 0 and output port 1. In total, four states, the dotted line represents connection, and the non-dotted line represents floating or non-connection.

It should be noted that, in the present embodiment, the description is generally performed by way of example, but such example description should not be construed as a specific limitation.

Through the multiplexer and the multilevel switch interconnection network, the internal connection state of each switch interconnection network can realize the interconnection between the input port and the output port of the bridge under the control of the control signal, so that the multilevel switch interconnection network can realize the interconnection between the input port and the output port of the bridge and the interconnection between the acceleration devices connected with the input port and the output port under the control of the control signal provided by the preset control strategy, namely, the acceleration devices can carry out data communication based on the interconnection.

Therefore, each level of switch interconnection network can realize internal state communication under the control of the control signal, can provide a bridge path for point-to-point non-blocking data communication between paired devices, and can realize arbitrary interconnection between accelerating devices through the bridge based on practical application requirements, namely according to the interconnection requirements between the devices and the internal connection state relation of the bridge, so as to meet the data communication requirement when the neural network model is flexibly deployed on the accelerating devices.

In implementation, the multilevel switch interconnection network adopts any one of the following control strategies: a level control strategy, a unit control strategy, and a partial level control strategy.

In some embodiments, the signal control may be performed on each stage of the switch interconnection network, that is, a stage control strategy is adopted to implement internal state interconnection of the multi-stage switch network.

Taking the crossing state shown in fig. 4 as an example, the first switch and the second switch are the first stage, and the third switch and the fourth switch are the second stage, so as to form a basic interconnection network circuit with 4 inputs and 4 outputs, and at this time, after the stage control signal is used, the relationship between each input port and each output port can be as shown in table 1 below.

TABLE 1 interconnect Signal control representation

Table 1 above shows the interconnection of the input and output ports of the switch under the stage control signal.

Wherein each switch can receive at least four state control signals, for the sake of simple understanding, only the cross control signal is taken as an example here, the "1" signal is used to represent data exchange, the "0" signal represents data direct input, and the stage control signal according to the above table 1 is formed, for example, when the first stage is "1" and the second stage is "0", then the signal of the input port 1 is output by the output port 2, i.e. data is sent from the device 1 to the device 2, and the signal of the device 2 is sent to the device 1, and is not expanded one by one here.

In some embodiments, the interconnection requirements of other modes can be flexibly realized by performing individual control on the same level and different rows, that is, adopting a unit control strategy and adopting individual control signals for different switch units.

In some embodiments, different units in different levels and same levels are controlled respectively, that is, a partial level control strategy is adopted to realize the internal state interconnection of each switch interconnection network, thereby further improving the interconnection flexibility.

In some embodiments, for each switch interconnection network, synchronous interconnection between a single device and multiple devices, that is, a data broadcasting function, such as that the device a broadcasts data to the devices B to D synchronously, may be implemented by using control signals corresponding to an up-cast and a down-cast.

In some embodiments, the data communication interconnection exchange between multiple devices may be implemented in a time division multiplexing manner, such as device a sending data to device B while device a receiving data from device C, so that the data exchange may not be limited to point-to-point data exchange.

In some embodiments, the time-division multiplexed interconnection may be by a time-division multiplexed control signal, wherein the input control signal comprises a first time-division multiplexed signal; and/or the output control signal comprises a second time division multiplexing signal.

The interconnection of the input port and the output port under the time division multiplexing is realized through the control of the time division multiplexing.

In some embodiments, an output data buffer queue unit may be disposed in front of the multiplexer at the output port, and data information from multiple input ports may be queue-buffered and then transmitted to the output port, so as to improve efficient data forwarding performance without blocking.

As shown in fig. 6, a buffer queue unit may be disposed between the multilevel switch interconnection network and the multiplexer in front of the output port, where the output buffer queue is disposed between the multilevel switch interconnection network and the multiplexer at the output port, and may be used to buffer data transmitted by interconnecting a plurality of input ports and the output port.

For example, in the example of 4 acceleration devices a to D in the foregoing embodiment, when the acceleration devices B to D forward data to the acceleration device a through the combined switch interconnection network, the device forwarding data may be respectively buffered in the buffer queue units of the corresponding packets and then respectively read by the device a, thereby avoiding information blocking and communication delay caused by changing the control signal.

It should be noted that the structure and control of the buffer queue are not specifically limited, and those skilled in the art can set the buffer queue, read and write data of the buffer queue, and the like according to actual needs.

It should be noted that for simplicity of illustration, multiplexers at the input ports, multilevel switch interconnect networks, etc. are not shown, and the dotted lines may represent internal interconnect paths of the switch interconnect networks.

In some embodiments, an input data buffer queue unit may be arranged after the multiplexer at the input port, so that data information at the input port may be transmitted to a plurality of output ports after being queue-buffered, and efficient data forwarding performance without blocking may be improved.

As shown in fig. 7, a data buffer queue unit may be provided after the multiplexer at the input port, the input buffer queue is arranged between the multiplexer at the input port and the multilevel switch interconnection network and can be used for data buffer of interconnection transmission of the input port and a plurality of output ports, so that the information transmitted to a plurality of output ports by the data of the same input port can be queue-buffered, data congestion caused by the fact that the input bandwidth is larger than the communication bandwidth of the interconnection network can be avoided, and the utilization rate of the input port links can be further improved (for example, if the bandwidth of an output port 1 of the input device A is 1Gbps, and the bandwidth of each input port in the switch interconnection network is 0.5Gbps, the switch interconnection network needs twice the communication time of the device A for information forwarding, the device A is required to wait, or the transmission bandwidth of the device A is reduced).

Therefore, by arranging the input buffer unit behind the multiplexer at the input end, the device A can transmit data with other devices in the forwarding period of the current input buffer data completed by the combined switch interconnection network, and the communication utilization rate of the device port is greatly improved.

In some embodiments, the bridge may further include a routing control unit, and the switch state interconnection control may be performed by the routing control unit.

In implementation, the routing control unit may receive data at an input end (i.e., a sending end, or referred to as a source end), obtain a destination address after parsing, generate switch interconnection configuration instruction information according to source address and destination address information, and generate an interconnection circuit control signal after decoding, thereby establishing a direct connection communication path in the switch interconnection network.

In some embodiments, the routing control unit may include a routing table, such as a dynamically maintainable routing table, and may store corresponding routing control information through each entry in the routing table, so as to support deployment and application of the neural network model more flexibly.

In implementation, the routing table is used to store relevant data of various transmission paths for use in routing, where the routing table stores the following items of information of internal interconnections of the bridge: and the contents of the input port, the output port, the control signal of the multilevel switch interconnection network, the interconnection state relation information and the like are further used for quickly generating configuration instruction information through the table entry contents so as to interconnect the input port, the output port and the multilevel switch interconnection network in the bridge.

It should be noted that each entry in the routing table may be preset according to actual deployment of the neural network model, or may be dynamically adjusted according to a deployment change requirement after deployment application, which is not limited herein.

In some embodiments, multiple items of routing information may be added to the routing table to facilitate rapid generation of switch interconnect configuration command information from the routing information to produce interconnect control signals.

In implementation, the bridge may determine the source address and the destination address according to a preset communication protocol and current instruction token information (e.g., ID of the sending device + ID of the receiving device + data). At this time, the routing information may include several pieces of information: the bridge ID + bridge port ID can be used as a unique bridge identifier; the chip ID + the link port ID are used as a unique device identifier; the communication instruction token can comprise the following information: instruction format type, transaction type (e.g., read, write, etc.), issue NUP1+ port 1:3, receive NUP 2: 3+ port 1:3, data, etc.

For example, the interconnect configuration instruction information defines that the sending device NPU1 utilizes transmit link ports 1:3 to route data through ports 1 of bridges 1:3 to receive link ports 1:3 of destination device NPU 2.

In some embodiments, each set of transceiving ports of the bridge may be identified and arranged according to a preset numbering policy, so that the bridge may be interconnected with the corresponding acceleration device according to the arranged identification, where the acceleration device (or the transceiving communication port of the acceleration device) is also numbered correspondingly.

In implementation, each acceleration device has at least one data sending end and at least one data receiving end, and at this time, the sending end and the receiving end can be paired to form a receiving and sending communication port, and then numbering and marking are carried out, so that the marked ports (input ports and output ports) in the bridge can be interconnected according to the arranged marks.

For example, in the foregoing example, 4 acceleration devices (i.e., acceleration device a to acceleration device D), the input port and the output port of each acceleration device are respectively numbered, and the input port and the output port of the bridge are also respectively numbered, for example, the port number of acceleration device a is "1", the port number of acceleration device B is "2", and the like, while the input port of the bridge is respectively numbered as "1" to "4", and the output port is respectively numbered as "1" to "4", then the input port "1" and the output port "1" of the bridge may constitute a first group of ports for connecting with acceleration device a, and the like, which are not listed.

Therefore, after the corresponding number is passed, the equipment connection, the use management and the like can be facilitated.

In some embodiments, the ports of the device and the ports of the bridge can be equivalently interconnected, and the structure of the interconnection switch network can be simplified through coding.

In implementation, a structure of a multilevel n-cube interconnection network is adopted as a core unit of a multilevel switch interconnection network inside a bridge.

In implementation, each vertex of the network of the single-level cube can represent one device, binary coding is carried out, after a certain bit in the coding of each vertex is inverted, the corresponding vertex can be interconnected with the corresponding vertex, a multi-level structure is adopted, pairwise interconnection between the vertices can be realized, limitation of diagonal vertices is avoided, and interconnection among accelerating devices can be conveniently expanded.

For example, the 0-dimensional space has only one point; the 1-dimensional space is a side, which has two vertices 0 and 1; the 2-dimensional space is a plane with 4 vertices (00, 01, 10, 11) and 4 edges; the 3-dimensional space is a cube with 8 vertices (000, 001, 010, 011, 100, 101, 110, 111), 12 edges and 6 faces; the 4-dimensional space has 16 vertices (0, 1, 2, 4, …, 15), 32 sides, 24 faces, 8 cubes, and 1 4-dimensional cube. Broadly, the vertices of the N-dimensional space can be represented by a series of N-bit distinct 0 and 1 encodings, the total number of vertices of which is 2N.

For example, in the foregoing example (e.g., the triangle network shown in fig. 5, the triangle has four vertices), the basic tool may be labeled (i.e., coded), and when deriving Qn from the vertex Qn-1, each time the number of vertices is doubled, a new Qn-1 is nested outside the original Qn-1, and then the points on the corners are connected to realize expansion.

Based on the same inventive concept, the embodiment of the present specification further provides an acceleration device interconnection system to support neural network model deployment application.

As shown in fig. 8, the acceleration device interconnection system may include a plurality of acceleration devices and the bridge according to any of the foregoing embodiments, where the transceiver ports of the acceleration devices may be interconnected with the paired transceiver ports of the bridge, that is, the data sending end of the acceleration device is connected with the corresponding input port in the bridge, and the data receiving end of the acceleration device is connected with the corresponding output port in the bridge, and the acceleration devices are interconnected by the bridge, so that after the neural network model is deployed in the interconnection system, the data transmission requirements in the neural network acceleration inference operation may be satisfied based on these interconnections.

In implementation, the acceleration device may be a processing chip, a terminal, an edge computing device, and the like based on small computational power, and further, through the acceleration device interconnection system provided in the embodiment of the present specification, the neural network model may be deployed and applied flexibly in the devices to accelerate inference operations, and the popularization and application of the neural network model may be supported.

It should be noted that the acceleration device may be a device for deploying a neural network model, for example, the acceleration device may be an acceleration chip (e.g., NPU), or an acceleration chipset (e.g., a chipset formed by a plurality of NPUs), and is not limited herein. The acceleration device interconnection system may further include a central controller, a memory, and other processing components, which are not limited herein.

Based on the same inventive concept, the embodiment of the present specification further provides a data acceleration processing method, so as to perform accelerated reasoning operation on a neural network based on the acceleration device interconnection system described in any one of the foregoing embodiments.

As shown in fig. 9, the data acceleration processing method includes:

step S902, generating a calculation graph corresponding to the neural network according to a neural network model to be accelerated, and segmenting the calculation graph according to a preset splitting strategy to form a plurality of network units;

in implementation, the splitting policy may include any one of the following policies:

(a) different chips run different layer networks, namely, the chips are longitudinally split along the depth direction of the model;

(b) different chips operate partial networks corresponding to at least one convolution kernel in the same layer (if a plurality of chips form a convolution kernel group, the acceleration chip is responsible for convolution operation corresponding to the current convolution kernel group), namely, the chips are transversely split along the depth direction of the model;

(c) different chips run different layer groups and different convolution kernel partial networks, namely, the chips are longitudinally and transversely split along the depth direction of the model.

It should be noted that the computation graph may be a graph that graphically represents a computation process of the neural network model, and is not limited herein; the partition of the computation graph may adopt an existing mature graph partition method, and is not limited herein.

It should be noted that the number of the neural network models to be accelerated may be one, or may be multiple (i.e., two or more), and thus the splitting policy is a policy for splitting one or more neural network models.

Step S904, loading the divided network unit into each acceleration device in the acceleration device interconnection system according to a preset deployment strategy, so as to perform accelerated inference operation on the neural network model.

In implementation, the chip-on-board set may be a chip set formed by interconnection of acceleration chips in any one of the embodiments provided in the above-mentioned multichip interconnection system, and the chip set may be one or more, and is not limited herein.

In implementation, the number of the acceleration devices may be the same as or different from the number of the network units obtained by splitting, and the deployment policy may be preset and adjusted according to actual deployment requirements, for example, one acceleration device (e.g., an acceleration chip) only deploys one network unit, for example, one acceleration device (e.g., an acceleration chip set) may deploy two or more network units, which is not limited herein.

In implementation, the deployment may be completed by issuing a configuration instruction to the acceleration device to cause the acceleration device to load the corresponding network element.

It should be noted that the acceleration processing method may further include other steps according to actual application needs, for example, a step of acquiring input data, for example, a step of loading the input data, for example, a step of outputting data after acceleration processing, and so on, which are not described herein again.

Through the steps S902-S904, after the neural network deployment is applied to the bridge-based acceleration equipment interconnection system, not only can the large-scale or even ultra-large-scale neural network be supported for acceleration processing (namely acceleration inference operation), but also one or more neural networks can be not limited for acceleration processing, so that even if the acceleration equipment is equipment based on a small-computation-force acceleration processor, a good technical basis can be provided for inference operation of a large-scale neural network model, interconnection and data communication among equipment in the acceleration processing can be well supported, and the capability of the existing equipment such as processors, terminals and edge computing for deploying and applying the neural network model is improved.

It should be noted that, for other complex neural network models, a plurality of NPU topology connection relationships and inter-chip communication modes can be implemented by configuring a bridge, which is not described herein again.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments described later, since they correspond to the previous embodiments, the description is simple, and the relevant points can be referred to the partial description of the previous embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

17页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种芯片配置方法及装置

Bridge, acceleration equipment interconnection system and data acceleration processing method

相关技术

网友询问留言