Message transmission method and message transmission device

文档序号:1956587 发布日期:2021-12-10 浏览:23次 中文

阅读说明:本技术 消息传输方法和消息传输装置 (Message transmission method and message transmission device ) 是由 林钦亮 王巧灵 于 2020-06-09 设计创作,主要内容包括:本申请提供了一种消息传输方法和消息传输装置,应用于包括N个计算节点和交换机的网内计算网络,有利于减少交换机的缓存需求,减少消息发送中断,从而提升网内计算的性能。该方法包括:N个计算节点中的第一计算节点通过交换机向该N个计算节点中的第二计算节点发送第一消息,该第一消息的标识为第一标识;第一计算节点接收来自该交换机的第二消息,该第二消息的标识为该第一标识,该第二消息是该N个计算节点所发送的标识为该第一标识的消息的聚合结果;第一计算节点基于该第二消息,通过该交换机向该第二计算节点发送第三消息,该第三消息为该第一计算节点下一个待发送的消息。(The application provides a message transmission method and a message transmission device, which are applied to an in-network computing network comprising N computing nodes and a switch, and are beneficial to reducing the cache demand of the switch and reducing message sending interruption, thereby improving the in-network computing performance. The method comprises the following steps: a first computing node in the N computing nodes sends a first message to a second computing node in the N computing nodes through a switch, wherein the identifier of the first message is a first identifier; the first computing node receives a second message from the switch, wherein the identifier of the second message is the first identifier, and the second message is an aggregation result of the messages which are sent by the N computing nodes and are identified as the first identifier; and the first computing node sends a third message to the second computing node through the switch based on the second message, wherein the third message is a message to be sent next by the first computing node.)

1. A message transmission method applied to an in-network computing network including N computing nodes and a switch, the method comprising:

a first computing node in the N computing nodes sends a first message to a second computing node in the N computing nodes through the switch, wherein the identifier of the first message is a first identifier;

the first computing node receives a second message from the switch, wherein the identifier of the second message is the first identifier, and the second message is an aggregation result of the messages which are sent by the N computing nodes and are identified as the first identifier;

and the first computing node sends a third message to the second computing node through the switch based on the second message, wherein the third message is a message to be sent next by the first computing node.

2. The method of claim 1, wherein before a first computing node of the N computing nodes sends a first message to a second computing node of the N computing nodes through the switch, the method further comprises:

the first computing node sets a sliding sending window, and the sliding sending window is used for identifying a message to be sent of the first computing node;

after the first computing node receives the second message from the switch, the method further comprises:

the first computing node moves the sliding send window forward by a first length equal to the length of the first message.

3. The method of claim 2, wherein the length of the sliding send window is less than or equal to a buffer size of the switch.

4. The method of any one of claims 1 to 3, wherein the N computing nodes form a ring network.

5. A message transmission apparatus applied to an in-network computing network including N computing nodes and a switch, the apparatus comprising:

a sending unit, configured to send a first message to a second computing node in the N computing nodes through the switch, where an identifier of the first message is a first identifier;

a receiving unit, configured to receive a second message from the switch, where an identifier of the second message is the first identifier, and the second message is an aggregation result of the messages that are sent by the N computing nodes and are identified as the first identifier;

the sending unit is further configured to:

and sending a third message to the second computing node through the switch based on the second message, wherein the third message is a message to be sent next by the first computing node.

6. The apparatus of claim 5, further comprising:

a processing unit, configured to set a sliding send window before sending a first message to a second computing node of the N computing nodes through the switch, where the sliding send window is used to identify a message to be sent of the device;

the processing unit is further to:

after receiving a second message from the switch, moving the sliding send window forward by a first length, the first length being equal to the length of the first message.

7. The apparatus of claim 6, wherein a length of the sliding send window is less than or equal to a buffer size of the switch.

8. The apparatus of any of claims 5 to 7, wherein the N computing nodes form a ring network.

9. A message transmission apparatus, comprising: a processor coupled to a memory, the memory storing a program that, when executed by the processor, causes the apparatus to perform the method of any of claims 1 to 4.

10. A computer-readable storage medium for storing a computer program comprising instructions for implementing the method of any one of claims 1 to 4.

11. A chip, comprising: a processor and an interface for invoking and executing a computer program stored in a memory, performing the method of any one of claims 1 to 4.

Technical Field

The present application relates to the field of distributed computing, and in particular, to a message transmission method and a message transmission apparatus.

Background

More and more network applications rely on large-scale computing, such as artificial intelligence, internet of things, cloud computing, and the like. And in order to realize large-scale calculation, the method is infeasible depending on a single node, only distributed calculation is realized, and the overall calculation time can be saved and the calculation efficiency can be improved through the cooperative processing of a plurality of nodes, so that high-performance calculation is realized.

Since distributed computing involves multiple computing nodes, network traffic becomes a performance bottleneck when distributed applications involve large data transfers. Taking Artificial Intelligence (AI) distributed training as an example, when a model is trained, a computing node needs to perform a large number of repeated training calculations (millions of iterative calculations or even more) on input data, and the data transmission amount involved in each training calculation is up to 500M (different training model data amounts are different), which causes the network communication time in the model training process to far exceed the actual training time. In this regard, by compressing network transmission time and transferring part of the computation process of distributed computation to a network device (e.g., a router or a switch, etc.), performance improvement can be brought to distributed computation. This approach is referred to as intra-network computation.

In the current intra-network computing scheme, the message confirmation time is too long, so that the cache requirement of the switch is high, and the message sending is easy to interrupt.

Disclosure of Invention

The application provides a message transmission method and a message transmission device, which are beneficial to reducing the cache demand of a switch and reducing message sending interruption, thereby improving the performance of calculation in a network.

In a first aspect, a message transmission method is provided, which is applied to an in-network computing network including N computing nodes and a switch, and includes: a first computing node in the N computing nodes sends a first message to a second computing node in the N computing nodes through the switch, wherein the identifier of the first message is a first identifier; the first computing node receives a second message from the switch, wherein the identifier of the second message is a first identifier, and the second message is an aggregation result of the messages which are sent by the N computing nodes and are identified as the first identifier; and the first computing node sends a third message to the second computing node through the switch based on the second message, wherein the third message is a message to be sent next by the first computing node.

"the first computing node sends a third message to the second computing node through the switch based on the second message" may be replaced with: "the first computing node sends a third message to the second computing node through the switch after receiving a second message from the switch. In other words, the first computing node, upon receiving the second message, may send a third message, the second message being used to trigger sending of the third message.

According to the message transmission method, the second message returned by the switch is used as the ACK of the first message sent by the first computing node, the second computing node does not need to return the ACK to the first computing node for confirmation, and the confirmation time delay of the message is reduced, so that the cache requirement of the switch is reduced, the message is sent more smoothly, and the sending interruption is reduced. In addition, the sending and confirming mechanism is simple, efficient and easy to deploy.

It should be understood that the first computing node is a sender (alternatively referred to as a sender node) and the second computing node is a receiver (alternatively referred to as a receiver node). And the switch receives the messages sent by each of the N computing nodes respectively and puts the messages into a buffer area. Until receiving the messages which are sent by all the N computing nodes and are identified as the first identifiers, the switch performs aggregation calculation based on the messages, and sends the aggregation result (namely the following second message) to each computing node. After the switch calculates all the messages marked as the first mark, the cache is cleared, and the first computing node sends the next message to be sent. The first computing node does not receive the ACK of the first message identified as the first identifier, the first computing node does not send the next message to be sent, the switch does not clear the cache, the switch needs to receive the messages with the same identifier sent by all the N computing nodes to perform aggregation calculation, otherwise, the switch waits for receiving.

The embodiments of the present application may be applied to an in-network computing network based on message communication, for example, an in-network computing network based on a mainstream transport protocol such as a multi-point interface (MPI) and an english collectible multi-GPU communication library (NCCL).

With reference to the first aspect, in certain implementations of the first aspect, before the first computing node of the N computing nodes sends the first message to the second computing node of the N computing nodes through the switch, the method further includes: the first computing node sets a sliding sending window, and the sliding sending window is used for identifying a message to be sent of the first computing node; after the first computing node receives the second message from the switch, the method further comprises: the first computing node moves the sliding send window forward by a first length equal to the length of the first message.

It should be understood that, exemplarily, the first identifier is an integer i, and the third message identifier is i + j, j being equal to the length of the sliding transmission window, j being a positive integer.

With reference to the first aspect, in certain implementations of the first aspect, a length of the sliding send window is less than or equal to a buffer size of the switch. In this way, buffer overflows of the switch are avoided.

With reference to the first aspect, in certain implementations of the first aspect, the N computing nodes form a ring network.

In a second aspect, a message transmission apparatus is provided for performing the method in any one of the possible implementation manners of the first aspect. In particular, the apparatus comprises means for performing the method of any one of the possible implementations of the first aspect described above.

In a third aspect, there is provided another message transmission apparatus, including a processor, coupled to a memory, and configured to execute instructions in the memory to implement the method in any one of the possible implementations of the first aspect. Optionally, the apparatus further comprises a memory. Optionally, the apparatus further comprises a communication interface, the processor being coupled to the communication interface.

In one implementation, the message transmitting device is a computing node. When the messaging device is a computing node, the communication interface may be a transceiver, or an input/output interface.

In another implementation, the message transmission device is a chip configured in the computing node. When the message transmission device is a chip configured in a computing node, the communication interface may be an input/output interface.

In a fourth aspect, a processor is provided, comprising: input circuit, output circuit and processing circuit. The processing circuit is configured to receive a signal via the input circuit and transmit a signal via the output circuit, so that the processor performs the method of any one of the possible implementations of the first aspect.

In a specific implementation process, the processor may be a chip, the input circuit may be an input pin, the output circuit may be an output pin, and the processing circuit may be a transistor, a gate circuit, a flip-flop, various logic circuits, and the like. The input signal received by the input circuit may be received and input by, for example and without limitation, a receiver, the signal output by the output circuit may be output to and transmitted by a transmitter, for example and without limitation, and the input circuit and the output circuit may be the same circuit that functions as the input circuit and the output circuit, respectively, at different times. The embodiment of the present application does not limit the specific implementation manner of the processor and various circuits.

In a fifth aspect, a processing apparatus is provided that includes a processor and a memory. The processor is configured to read instructions stored in the memory, and may receive signals via the receiver and transmit signals via the transmitter to perform the method of any one of the possible implementations of the first aspect.

Optionally, there are one or more processors and one or more memories.

Alternatively, the memory may be integrated with the processor, or provided separately from the processor.

In a specific implementation process, the memory may be a non-transient memory, such as a Read Only Memory (ROM), which may be integrated on the same chip as the processor, or may be separately disposed on different chips.

It will be appreciated that the associated data interaction process, for example, sending the indication information, may be a process of outputting the indication information from the processor, and receiving the capability information may be a process of receiving the input capability information from the processor. In particular, the data output by the processor may be output to a transmitter and the input data received by the processor may be from a receiver. The transmitter and receiver may be collectively referred to as a transceiver, among others.

The processing device in the fifth aspect may be a chip, the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated with the processor, located external to the processor, or stand-alone.

In a sixth aspect, there is provided a computer program product comprising: computer program (also called code, or instructions), which when executed, causes a computer to perform the method of any of the possible implementations of the first aspect described above.

In a seventh aspect, a computer-readable storage medium is provided, which stores a computer program (which may also be referred to as code or instructions) that, when executed on a computer, causes the computer to perform the method in any of the possible implementations of the first aspect.

In an eighth aspect, a message transmission system is provided, which includes the aforementioned computing node.

Drawings

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application.

Fig. 2 is a schematic diagram of another system architecture according to an embodiment of the present application.

Fig. 3 is a schematic flow chart of a message transmission method according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a sliding send window according to an embodiment of the present application.

Fig. 5 is a schematic block diagram of a message transmission apparatus according to an embodiment of the present application.

Fig. 6 is a schematic block diagram of another message transmission apparatus according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

More and more network applications rely on large-scale computing, such as artificial intelligence, internet of things, cloud computing, and the like. And in order to realize large-scale calculation, the method is infeasible depending on a single node, only distributed calculation is realized, and the overall calculation time can be saved and the calculation efficiency can be improved through the cooperative processing of a plurality of nodes, so that high-performance calculation is realized. Distributed computing refers to the use of multiple computing nodes (e.g., servers, workstations) to provide coordinated computing for an application.

Since distributed computing involves multiple computing nodes, network traffic becomes a performance bottleneck when distributed applications involve large data transfers. Taking Artificial Intelligence (AI) distributed training as an example, when a model is trained, a computing node needs to perform a large number of repeated training calculations (millions of iterative calculations or even more) on input data, and the data transmission amount involved in each training calculation is up to 500M (different training model data amounts are different), which causes the network communication time in the model training process to far exceed the actual training time, and the network communication time of some models even accounts for more than 90% of the whole model training time. In this regard, by compressing network transmission time and transferring part of the computation process of distributed computation to a network device (e.g., a router or a switch, etc.), performance improvement can be brought to distributed computation. The partial calculation may be, for example, addition, subtraction, minimum value, maximum value, etc., and is not limited herein.

This type of computation is referred to as in-network computing (in-net computing). The intra-network computing can fully utilize network computing resources and realize the speed increase of distributed computing. Specifically, intra-network computation can allocate part of key computation for distributed computing nodes, and also can provide aggregation computation for the distributed computing nodes, so that multiple data are aggregated into one, thereby compressing network bandwidth occupation and accelerating network transmission.

A switch in an intra-network computing scenario may be referred to as an intra-network computing switch. In addition to the traditional switch function, the in-network computing switch also has certain programmable capability and caching capability, and can cache data and perform computation (such as addition and subtraction). For convenience of description, the in-network computing switch is hereinafter referred to as a "switch" for short.

The current intra-network computing scenario mainly includes: AI distributed training, MapReduce, distributed database, and the like. And aggregation computation (data reduction) -oriented intra-network computation is an important application scenario, and is characterized in that aggregation computation is performed on data streams, so that data compression and network transmission amount reduction are achieved. The principle of the aggregation calculation is to perform aggregation calculation on multiple data sent by multiple computing nodes and output only one aggregated data.

Fig. 1 shows a schematic diagram of a system architecture 100 of an embodiment of the present application. The system architecture 100 includes a Parameter Server (PS) 110, a switch 120, a compute node 130, a compute node 140, and a compute node 150.

In a non-intra-network computing scenario, the computing node 130, the computing node 140, and the computing node 150 may send training results to the parameter server 110 through the switch 120, respectively, and the parameter server 110 is configured to store the training results, aggregate the results of the training of the computing nodes in each iteration, and send the results to the computing nodes again. In particular, during each training iteration (an AI training task typically involves training iterations in the order of up to one hundred thousand or even millions, including the following steps:

1. each computing node respectively sends the trained model results (hereinafter, referred to as training results for short, and includes parameters, gradients, and the like, for example) to the parameter server 110;

2. the parameter server 110 receives the training results sent by each computing node, and performs aggregate computation on the training results (for example, performing addition and averaging on the parameters and the gradients), and updates the AI model accordingly;

3. and the parameter server sends the updated model parameters to all the computing nodes.

It can be seen that, after each training iteration, the computing nodes 130, 140, and 150 in fig. 1 respectively send their trained results to the parameter server, and assuming that the data amount of the training result of each computing node is n, the total data amount received by the parameter server is 3 n.

In the scenario of intra-network computing, the switch 120 is an intra-network computing switch, and in the process that the three computing nodes send the training results to the parameter server 110, the switch 120 performs aggregation computing on the data when the data passes through itself, so that the data received by the subsequent parameter server 110 is the result after the aggregation of the switch 120 is completed. Illustratively, in fig. 1, the parameters sent by the computing node 130 are 1, 2, and 3 … in sequence, the parameters sent by the computing node 140 are 4, 5, and 6 … in sequence, and the parameters sent by the computing node 150 are 7, 8, and 9 … in sequence, so that the switch 120 sends the average of the above results to the parameter server 110: 4. 5, 6 …. Therefore, it can be seen that the switch 120 compresses the original three data into one, which can greatly compress the data transmission amount in a large-scale AI training scenario (which may include hundreds or even thousands of computing nodes), thereby improving the network transmission efficiency and improving the overall training performance.

Fig. 2 shows a schematic diagram of another system architecture 200 of an embodiment of the present application. The system architecture 200 includes a switch 210, a compute node 220, a compute node 230, and a compute node 240. In the system architecture 200, information is transmitted in the form of a ring (ring), as indicated by the arrows in fig. 2, i.e., compute node 220- > compute node 230- > compute node 240- > compute node 220. In the process that one computing node sends a message to another computing node through the switch 210, the switch 210 intercepts the messages of all the computing nodes and performs aggregation calculation, and then returns an aggregation result to all the computing nodes, and the computing nodes still maintain ring-based one-to-one communication.

The basic flow of intra-network computation for aggregation-oriented computation is described in detail below with reference to fig. 1, and mainly includes the following steps:

1. the computing nodes 130, 140 and 150 synchronously send messages with data to be computed (hereinafter referred to as data messages for short) to the parameter server 110 through the switch 120, and the messages are numbered by index (index) values;

2. the switch 120 receives the data message, stores the received data message in a buffer area of the switch, and performs aggregation calculation on parameters in the data message with the same index value;

3. after all data packets with an index value sent by all the computing nodes are calculated, the switch 120 broadcasts the aggregation result of the data packets with the index value to all the computing nodes in a packet form.

It should be noted that by performing index numbering on the data packets sent by each computing node, the switch 120 can distinguish the data packets sent by a plurality of different computing nodes when performing aggregation calculation, and perform aggregation calculation on the data packets from different computing nodes but with the same index value. In addition, the switch 120 needs a buffer to cache data, that is, before aggregation calculation, the switch 120 may cache data packets sent by some computing nodes in the buffer, and then wait for other computing nodes to send data packets with the same index value, until all data packets with the same index value sent by the computing nodes are received, the switch 120 may not perform aggregation calculation.

The above process is also applicable to the system architecture 200 shown in fig. 2, and is not described herein again.

Because the capacity of the buffer area is limited, the buffer area needs to be reused, and after the calculation is completed, the switch can clear the calculated data message in the buffer area. The switch typically cleans up the cache in a first-in-first-out manner.

In summary, the intra-network computation has the following characteristics:

1. the switch performs aggregation calculation on the data messages with the same index value sent by each computing node, and the switch sends out an aggregation result after the data messages with the same index value sent by all the computing nodes are calculated, otherwise packet loss occurs;

2. the switch needs to buffer the received data message in the buffer area, and adopts a first-in first-out mode, when the aggregation is completed and a new data message is received, the switch can empty the oldest buffer message for buffering the next newly received data message;

3. each computing node is provided with a respective sending window for indicating a data message to be sent, and the size of the sending window is limited by the size of the buffer area of the switch, that is, the size of the sending window of each computing node must not exceed the size of the buffer area of the switch, otherwise, the buffer area of the switch overflows.

For example, taking the above fig. 1 as an example, assuming that "1" and "4" sent by the computing node 130 and the computing node 140 have reached the switch 120, the switch 120 waits for the computing node 150 to send "7", and the switch 120 does not perform the aggregation calculation until receiving "7", and sends the aggregation result "4" to the parameter server 110. However, due to the limited buffering of the switch 120, if the computing node 150 has not sent data all the time, the computing node 130 and the computing node 140 should stop sending other data before the buffer of the switch overflows, for example, the computing node 130 cannot continue sending "2" and "3" and the computing node 140 cannot continue sending "5" and "6". Therefore, for each computing node, the computing node should adopt a send window smaller than or equal to the size of the buffer of the switch, that is, each time an acknowledgement message (ACK) sent by the parameter server 110 as a receiving end is received, the computing node slides its send window forward until it stops sending when the content that can be sent by the send window is empty.

In the intra-network computing network, from the perspective of the host side communication protocol, two categories can be distinguished: a communication mode based on a reliable connection protocol and a communication mode based on an unreliable connection protocol.

In a communication method based on an unreliable connection protocol, a host side realizes one-to-many communication by adopting a connectionless protocol (for example, UDP protocol), and the host side is only responsible for simple flow control such as packet loss retransmission and does not execute other complicated flow control actions such as congestion control. Therefore, this communication method is not mainstream, and is generally used only in research.

In the communication method based on the reliable connection protocol, the host side uses a communication protocol with one-to-one connection, such as a Transmission Control Protocol (TCP) protocol, a reliable Remote Direct Memory Access (RDMA) protocol, and the like. Illustratively, in fig. 1, computing nodes 130, 140, and 150 each establish one-to-one communication with parameter server 110. The switch 120 intercepts the data in the middle to perform aggregation calculation, and puts the aggregation result into a message after the calculation is completed and sends the message to the parameter server 110 (i.e., the destination host). In this way, host-side deployment is easy, and simple modifications can be made directly to native application scenarios (e.g., tensorflow in AI distributed training scenarios). In current and future in-network computing scenarios, high-performance RDMA network cards are the mainstream and can process data stream messages at high speed by using hardware, so a communication mode based on a reliable connection protocol also becomes a main communication mode for in-network computing.

The information transmission of current high-performance networks is basically based on the transmission of messages (messages), i.e. the content sent/received is encapsulated in one message. If an RDMA network card is used, the RDMA network card will automatically package (by hardware) and send the message, and the network card of the receiving party will automatically assemble (by hardware) the message into a message and send to the memory of the host. Therefore, the message-based transmission method is the mainstream of the computing network in the network.

Current in-network computing lacks a message acknowledgement approach applicable to the transmission characteristics of in-network computing networks. The existing information confirmation methods are divided into two types: a User Datagram Protocol (UDP) message based transmission and acknowledgement scheme, and an RDMA message based transmission and acknowledgement scheme.

The sending and acknowledging mode based on UDP messages cannot support RDMA protocol (i.e. messages are the sending unit). In an RDMA message-based send and acknowledge scheme, as described above, the computing node may slide through the send window to send the message, and when an ACK is received (i.e., the receiving end replies with an acknowledge message to the received message), the send window slides forward. Briefly, after a node a sends a message, the message will arrive at a node B from the node a via the network, and then the node B replies with an ACK, which arrives at the node a from the node B via the network, and the whole process takes one round-trip time (RTT). Therefore, it can be seen that node a sends a packet and requires RTT time to complete the acknowledgement.

However, in the intra-network computing process, as mentioned in the above-mentioned characteristics of intra-network computing, buffer cleaning of the switch occurs when a message is processed (i.e. a data packet with the same index value is aggregated), that is, the computing node does not send a new message, and the switch does not clean the buffer (because the switch needs to receive the data packets with the same index of all the computing nodes to perform aggregation computing, otherwise, the switch waits to receive the data packet).

In the existing scheme, a data message sent by a computing node needs one RTT time to be confirmed, the cache requirement of a switch is increased, the size of a buffer area of the switch is limited, and the message cached in the buffer area cannot be cleared in time due to long confirmation time, so that a sending window of the computing node serving as a sending end cannot slide, the sending is easy to send and interrupt, and the sending is not smooth.

In view of this, embodiments of the present application provide a new message transmission method and a new message transmission apparatus, which can shorten the message acknowledgement time, and are beneficial to reducing the cache requirement of the switch and reducing the message sending interruption, thereby improving the performance of intra-network computation.

Before describing the method provided by the embodiments of the present application, the following description is made.

First, in the embodiment of the present application, "pre-obtaining" may be implemented by saving a corresponding code, table, or other means that can be used to indicate related information in advance in a device (e.g., including a computing node), and the present application is not limited to a specific implementation manner thereof.

Second, in the embodiments shown below, terms and english abbreviations such as sliding sending window, aggregation calculation, etc. are exemplary examples given for convenience of description, and should not limit the present application in any way. This application is not intended to exclude the possibility that other terms may be defined in existing or future protocols to carry out the same or similar functions.

Third, the first, second and various numerical numbers in the embodiments shown below are merely for convenience of description and are not intended to limit the scope of the embodiments of the present application. E.g., to distinguish between different messages, to distinguish between different lengths, etc.

Fourth, "at least one" means one or more, "at least two" and "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, and c, may represent: a, or b, or c, or a and b, or a and c, or b and c, or a, b and c, wherein a, b and c can be single or multiple.

The message transmission method provided by the present application is described in detail below with reference to the accompanying drawings. Fig. 3 shows a schematic flow chart of a message transmission method 300 provided by the embodiment of the present application. The method may be applied to an intra-network computing network including N computing nodes and switches, such as the system architecture 100 shown in fig. 1 or the system architecture 200 shown in fig. 2, which is not limited in this embodiment of the present invention. The method 300 may be performed by any of the N computing nodes, for example, by any of the system architectures 100 or 200 described above, hereinafter referred to as a first computing node. The method 300 specifically includes the following steps:

s310, a first computing node sends a first message to a second computing node in the N computing nodes through a switch, where an identifier of the first message is a first identifier, and the first identifier may be a numeric value, an alphabet, or other characters, which is not limited in this embodiment of the present application. Illustratively, the first identifier may be i, which is an integer.

It should be understood that the first computing node is a sender (alternatively referred to as a sender node) and the second computing node is a receiver (alternatively referred to as a receiver node). The first identifier is i, that is, the index value of the message is i.

Illustratively, in the system architecture 100 shown in fig. 1, the first computing node may be any one of the computing nodes 130, 140, 150, and the second computing node may be the parameter server 110.

Illustratively, in the system architecture 200 shown in FIG. 2, the first computing node may be computing node 220, the second computing node may be computing node 230; alternatively, the first computing node may be computing node 230 and the second computing node may be computing node 240; alternatively, the first computing node may be computing node 240 and the second computing node may be computing node 220.

It should be understood that the steps performed by the other computing nodes of the N computing nodes are the same as the first computing node, i.e., the other computing nodes will send the message identified as the first identifier through the switch, and are not listed any more in the following.

And the switch receives the messages sent by each of the N computing nodes respectively and puts the messages into a buffer area. Until receiving the messages which are sent by all the N computing nodes and are identified as the first identifiers, the switch performs aggregation calculation based on the messages, and sends the aggregation result (namely the following second message) to each computing node.

S320, the first computing node receives a second message from the switch, where the identifier of the second message is the first identifier, and the second message is an aggregation result of the messages sent by the N computing nodes and identified as the first identifier.

Exemplarily, in the system architecture 100 shown in fig. 1, the parameters corresponding to the messages sent by the computing node 130 are sequentially 1 (identified as i), 2 (identified as i +1), and 3 (identified as i +2), the parameters corresponding to the messages sent by the computing node 140 are sequentially 4 (identified as i), 5 (identified as i +1), and 6 (identified as i +2), the parameters corresponding to the messages sent by the computing node 150 are sequentially 7 (identified as i), 8 (identified as i +1), and 9 (identified as i +2), and then the second message returned by the switch 120 is sequentially: 4 (identified as i), 5 (identified as i +1), 6 (identified as i + 2).

It should be understood that after the switch has calculated all the messages identified as the first identifier, the cache is cleared and the first computing node will send the next message to be sent. The first computing node does not receive the ACK of the first message identified as the first identifier, the first computing node does not send the next message to be sent, the switch does not clear the cache, the switch needs to receive the messages with the same identifier sent by all the N computing nodes to perform aggregation calculation, otherwise, the switch waits for receiving.

Optionally, the second message is sent by the switch to all the computing nodes (i.e., the N computing nodes). Illustratively, the switch may send the second message to the N computing nodes by broadcasting; or, the switch may send a second message to each of the N computing nodes, where information of a header of each second message may be an Internet Protocol (IP) address of the computing node corresponding to the second message, and a message body of the second message sent to each computing node is the same.

S330, the first computing node sends a third message to the second computing node through the switch based on the second message, where the third message is a next message to be sent by the first computing node. In other words, the first computing node, upon receiving the second message, may send a third message, the second message being used to trigger sending of the third message.

The embodiment of the application takes the second message which is from the switch and is identified as the first identifier as the ACK of the first message which is identified as the first identifier. The first computing node receives the second message, which indicates that the switch has completed the aggregation computation, and does not need to cache the messages from the computing nodes identified as the first identifiers, so that the first computing node can send the next message to be sent after receiving the second message.

According to the message transmission method, the second message returned by the switch is used as the ACK of the first message sent by the first computing node, the second computing node does not need to return the ACK to the first computing node for confirmation, and the confirmation time delay of the message is reduced, so that the cache requirement of the switch is reduced, the message is sent more smoothly, and the sending interruption is reduced. In addition, the sending and confirming mechanism is simple, efficient and easy to deploy.

The embodiments of the present application may be applied to an in-network computing network based on message communication, for example, an in-network computing network based on a mainstream transport protocol such as a multi-point interface (MPI) and an english collectible multi-GPU communication library (NCCL).

As an optional embodiment, before the first computing node of the N computing nodes sends the first message to the second computing node of the N computing nodes through the switch, the method further comprises: a first computing node sets a sliding sending window, wherein the sliding sending window is used for identifying a message to be sent of the first computing node; after the first computing node receives the second message from the switch, the method further comprises: the first computing node moves the sliding send window forward a first length equal to the length of the first message.

It should be appreciated that the length of the sliding send window is less than or equal to the buffer size of the switch. Illustratively, if the first identifier is i, the identifier of the third message is i + j, j is equal to the length of the sliding transmission window, and j is an integer.

Fig. 4 shows a schematic diagram of a sliding send window in an embodiment of the present application, as shown in fig. 4, the length of the sliding send window is equal to 3, the first compute node sends a message i (the first message), a message i +1, and a message i +2 in sequence through the switch, the switch buffers the message i in the store 0 of the buffer, buffers the message i +1 in the store 1 of the buffer, and buffers the message i +2 in the store 2 of the buffer. If the first computing node does not receive the aggregation result (the second message, i.e. the ACK of the message i) returned by the switch, the window cannot be slid, and the first computing node may suspend the sending of the message. If the first computing node receives the second message returned by the switch, the first computing node slides the window for a certain distance, and the new sliding window sends a message i +1, a message i +2 and a message i + 3. Since the message i +1 and the message i +2 are already sent, the next message to be sent of the first computing node is a message i + 3. The first computing node may continue to send message i +3 (the third message described above). At this point, the switch has completed clearing store 0, and the switch may buffer the received message i +3 in store 0.

As an alternative embodiment, the N computing nodes form a ring network. It should be understood that the ring network may be the system architecture 200 shown in fig. 2, and will not be described herein.

It should be understood that the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The message transmission method according to the embodiment of the present application is described in detail above with reference to fig. 1 to 4, and the message transmission apparatus according to the embodiment of the present application is described in detail below with reference to fig. 5 to 6.

Fig. 5 shows a message transmission apparatus 500 provided in an embodiment of the present application, where the apparatus 500 includes: a transmitting unit 510 and a receiving unit 520.

The sending unit 510 is configured to send a first message to a second computing node in the N computing nodes through the switch, where an identifier of the first message is a first identifier; the receiving unit 520 is configured to receive a second message from the switch, where an identifier of the second message is a first identifier, and the second message is an aggregation result of the messages that are sent by the N computing nodes and are identified as the first identifier; the sending unit 510 is further configured to: and sending a third message to the second computing node through the switch, wherein the third message is a message to be sent next by the first computing node.

Optionally, the apparatus further comprises: a processing unit, configured to set a sliding send window before sending a first message to a second computing node of the N computing nodes through the switch, where the sliding send window is used to identify a message to be sent of the device; the processing unit is further to: after receiving a second message from the switch, moving the sliding send window forward by a first length, the first length being equal to the length of the first message.

Optionally, the length of the sliding send window is smaller than or equal to the buffer size of the switch.

Optionally, the N computing nodes form a ring network.

It should be appreciated that the apparatus 500 herein is embodied in the form of a functional unit. The term "unit" herein may refer to an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared, dedicated, or group processor) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality. In an alternative example, it may be understood by those skilled in the art that the apparatus 500 may be embodied as a computing node in the foregoing embodiment, or functions of the computing node in the foregoing embodiment may be integrated in the apparatus 500, and the apparatus 500 may be configured to execute each process and/or step corresponding to the computing node in the foregoing method embodiment, and in order to avoid repetition, details are not described here again.

The above apparatus 500 has functions of implementing corresponding steps executed by the computing nodes in the above method; the above functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above. For example, the obtaining unit 510 may be a communication interface, such as a transceiver interface.

In an embodiment of the present application, the apparatus 500 in fig. 5 may also be a chip or a chip system, for example: system on chip (SoC). Correspondingly, the obtaining unit 510 may be a transceiver circuit of the chip, and is not limited herein.

Fig. 6 shows another message transmission apparatus 600 provided in the embodiment of the present application. The apparatus 600 includes a processor 610, a transceiver 620, and a memory 630. Wherein the processor 610, the transceiver 620 and the memory 630 are in communication with each other through an internal connection path, the memory 630 is used for storing instructions, and the processor 610 is used for executing the instructions stored in the memory 630 to control the transceiver 620 to transmit and/or receive signals.

Wherein the transceiver 620 is configured to: sending a first message to a second computing node in the N computing nodes through the switch, wherein the identifier of the first message is a first identifier; receiving a second message from the switch, wherein the identifier of the second message is a first identifier, and the second message is an aggregation result of the messages which are sent by the N computing nodes and are identified as the first identifier; and sending a third message to the second computing node through the switch, wherein the third message is a message to be sent next by the first computing node.

Optionally, the processor 610 is configured to: before sending a first message to a second computing node of the N computing nodes through the switch, setting a sliding sending window, wherein the sliding sending window is used for identifying a message to be sent of the device; after receiving a second message from the switch, moving the sliding send window forward by a first length, the first length being equal to the length of the first message.

Optionally, the length of the sliding send window is smaller than or equal to the buffer size of the switch.

Optionally, the N computing nodes form a ring network.

It should be understood that the apparatus 600 may be embodied as a computing node in the foregoing embodiments, or the functions of the computing node in the foregoing embodiments may be integrated in the apparatus 600, and the apparatus 600 may be configured to execute each step and/or flow corresponding to the computing node in the foregoing method embodiments. The memory 640 may alternatively comprise read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information. The processor 620 may be configured to execute the instructions stored in the memory, and when the processor executes the instructions, the processor may perform the steps and/or procedures corresponding to the computing node in the above method embodiments.

It should be understood that, in the embodiments of the present application, the processor may be a Central Processing Unit (CPU), and the processor may also be other general processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor executes instructions in the memory, in combination with hardware thereof, to perform the steps of the above-described method. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:通信方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类