RDMA communication acceleration set communication method and system

文档序号:153079 发布日期:2021-10-26 浏览:35次 中文

阅读说明:本技术 一种rdma通信加速集合通信的方法及系统 (RDMA communication acceleration set communication method and system ) 是由 谭光明 马潇潇 朱泓睿 王展 元国军 安学军 于 2021-07-30 设计创作,主要内容包括:本发明涉及通信技术领域,具体为一种RDMA通信加速集合通信的方法及系统,其方法包括:S3,网络设备获取WR或网络配置包的通信元数据,判断WR或网络配置包是否为集合通信卸载通信,若是,则执行步骤S4;S4,网络设备获取网络数据,在网络设备上采用细粒度的缓冲区管理机制进行集合通信的卸载任务处理;S6,若本次任务完成、应用下发查询请求或任务出现异常,则网络设备将WR或网络配置包对应的CQE反馈给主机或应用。本方案能高效、易用进行RDMA通信加速集合通信,以解决现有集合通信卸载效率低的问题。(The invention relates to the technical field of communication, in particular to a method and a system for RDMA communication acceleration set communication, wherein the method comprises the following steps: s3, the network device acquires the communication metadata of WR or network configuration packet, judges whether the WR or network configuration packet is the collective communication unloading communication, if yes, executes step S4; s4, the network device acquires the network data, and the unloading task processing of the set communication is carried out on the network device by adopting a fine-grained buffer area management mechanism; and S6, if the task is completed, the application issues the query request or the task is abnormal, the network device feeds back the WR or the CQE corresponding to the network configuration packet to the host or the application. The scheme can efficiently and easily carry out RDMA communication acceleration set communication so as to solve the problem of low unloading efficiency of the existing set communication.)

1. A method of accelerating aggregated communications for RDMA communications, characterized by: the method comprises the following steps:

s3, the network device acquires the communication metadata of WR or network configuration packet, judges whether the WR or network configuration packet is the collective communication unloading communication, if yes, executes step S4;

s4, the network device acquires the network data, and the unloading task processing of the set communication is carried out on the network device by adopting a fine-grained buffer area management mechanism; the network equipment acquires WR, and before acquiring network data, the network equipment performs configuration of aggregate communication unloading;

and S6, if the task is completed, the application issues the query request or the task is abnormal, the network device feeds back the WR or the CQE corresponding to the network configuration packet to the host or the application.

2. The method of RDMA communication accelerated aggregation communication of claim 1, wherein: the S3, further comprising: if not, go to S5;

and S5, the network device acquires the network data and performs ordinary network communication processing on the data.

Before S3, the method further includes:

s1, RDMA communication operation is carried out in the application, and data communication requirements are generated;

s2, the application sends WR or network configuration package containing communication cell data to the network equipment, wherein the communication cell data contains operation type, doorbell information, request queue virtual address and length;

after S6, the method further includes:

s7, checking whether all communication operations contained in the communication are completed, if so, executing S8; if the execution is not complete, go to S2;

s8, the communication task execution is ended.

3. The method of RDMA communication accelerated aggregation communication of claim 2, wherein: the S4 specifically includes:

s401, analyzing configuration information of set communication unloading:

s402, initiating a local data reading request or waiting for receiving external data;

s403, acquiring a data packet containing network data, analyzing the data packet, judging the type of the data packet, and executing S404 if the data packet is the final calculation result of the set communication unloading; if the data to be calculated is the data unloaded by the set communication, executing S405;

s404, broadcasting the final calculation result of the set communication unloading to each Rank related to the set communication in the Node;

s405, performing block calculation on data to be calculated;

s406, judging whether all the Rank data related to the set communication in the Node are calculated, and if so, executing S407; if the calculation is not completed, executing S402;

s407, if all the ranks related to the set communication are in the Node, directly broadcasting the calculation result in the Node; if not, packaging the data completing the aggregate communication calculation in the Node, and sending the data to the next Node and other nodes for further calculation;

s408, checking the unloading completion state of the set communication, and if the unloading completion state is completed or abnormal, executing S6; if not, go to S402.

4. The method of RDMA communication accelerated aggregation communication of claim 3, wherein: the configuration information includes important control information; the important control information includes: op field, JID field, MID field, SID field, Node field, Rank field, Seg field, data block length field, aggregate communication execution state field, address linked list field, routing information field.

5. The method of RDMA communication accelerated aggregation communication of claim 3, wherein: the block calculation is performed for each Seg i cache unit, and includes:

s4051, starting to perform cache calculation;

s4052, receiving Seg i' data;

s4053, detecting whether the Seg i cache unit for calculation is empty, if so, indicating that the received Seg i' data is the first block of data for the Seg i cache unit to perform set communication calculation, and executing S4054; if not, the Seg i cache unit already has part of the set communication data to be calculated, and S4057 is executed;

s4054, obtaining Op field, Node field and Rank field corresponding to Seg i cache unit;

s4056, filling the acquired Op field, Node field and Rank field corresponding to Seg i and the received Seg i data into a Seg i cache unit;

s4057, reading the data of the Seg i cache unit, and extracting an Op field for judging the type of the calculation operation to be performed;

s4058, the calculation operation appointed by Op field is carried out on the read Seg i data and the received Seg i' data, and the Node set communication completion state field and the Rank set communication completion state field are updated;

s4059, filling the data, the Node set communication completion status field and the Rank set communication completion status field into the corresponding Seg i cache unit;

s40510, checking Node set communication completion status field and Rank set communication completion status field, judging whether the Seg i cache unit of the network equipment completes all set communication data calculation related to the network equipment, if yes, executing S40511; if not, executing S4052, and waiting for other data related to Seg i to arrive;

s40511, data encapsulation is carried out on the Seg i data;

s40512, releasing the Seg i cache unit;

s40513, the network device completes the aggregate communication cache calculation operation with respect to Seg i.

6. A system for accelerating aggregated communications for RDMA communications, characterized by: the system comprises an execution control module, a data processing module and a state updating module;

the execution control module is used for acquiring WR or network configuration packets, analyzing and storing the WR or network configuration packets, and sending the analyzed configuration information of the set communication unloading to the data module group and the state updating module;

the data processing module group is used for carrying out unloading task processing of collective communication on the network equipment by adopting a fine-grained buffer area management mechanism;

and the state updating module is used for writing the completion state information back to the host end through the completion event queue.

7. The system for RDMA accelerated aggregation communication of claim 6, wherein: the data processing module group comprises: the data transmission system comprises a data cache calculation module, a data analysis and encapsulation module and a data transmission module;

the data cache calculation module is used for carrying out block calculation on data related to the set communication;

the data analysis and encapsulation module is used for analyzing and splitting data acquired from a network port or a local memory according to the configuration information issued by the execution control module and sending the data to the data cache calculation module for processing; the data transmission module is also used for packaging the data after all the set communication calculations are completed and sending the data to the data transmission module;

the data transmission module is used for reading data from the Node local memory, writing the data into the Node local memory, sending the data to an external port and receiving the data from the external port; and the data communication module is also used for sending the completion condition of the data communication to the state updating module after the communication transmission is finished each time.

8. The system for RDMA accelerated aggregation communication of claim 6, wherein: the configuration information includes important control information; the important control information includes: op field, JID field, MID field, SID field, Node field, Rank field, Seg field, data block length field, aggregate communication execution status field, address linked list field, and routing information field.

9. The system for RDMA accelerated aggregation communication of claim 7, wherein: the data cache calculation module comprises: a calculation control submodule and a cache submodule;

the cache submodule comprises a plurality of Seg cache units which take Seg as segment bits; the Seg cache unit is used for caching data in the calculation control submodule and storing and calculating the data by taking the Seg as a unit;

and the calculation control submodule is used for acquiring configuration information from the execution control module, reading Seg data from the buffer submodule, finishing calculation of input data of the same Seg and original data in the Seg buffer unit by using a vector calculation component in the module, updating corresponding Seg data bits and storing calculated data after calculation is finished, releasing a Seg j unit after all Ranks related to the Seg j in the node finish calculation of set communication, and transmitting the Seg j which is finished calculation to the data analysis and packaging module for subsequent packaging processing.

10. The system for RDMA accelerated aggregation communication of claim 9, wherein: each Seg contains a plurality of data fields, including: op field, Node set communication completion state field, Rank set communication completion state field and data load field;

the data cache calculation module is used for performing block calculation on data involved in set communication, and comprises: each segi cache unit performs cache calculation, specifically:

s1, the data cache calculation module starts to perform cache calculation;

s2, receiving Seg i' data sent by the data analysis and encapsulation module;

s4, the calculation control sub-module detects whether the Seg i cache unit is empty, if so, the calculation control sub-module indicates that the received Seg i' data is the first block data of the Seg i cache unit for aggregate communication calculation, and executes S4; if not, it indicates that the Seg i cache unit already has part of the set communication data to perform calculation, and executes S6;

s4, the calculation control sub-module obtains Seg i from the execution control moduleCache unitCorresponding to Op field, Node field and Rank field;

s5, the calculation control sub-module fills the Op field, the Node field and the Rank field corresponding to the Seg i and the received Seg i data sent by the data analysis encapsulation module into a Seg i cache unit;

s6, the calculation control sub-module reads the data of the Seg i cache unit from the cache sub-module, and extracts an Op field for judging the type of the calculation operation to be performed;

s7, the calculation control sub-module performs calculation operation appointed by Op field on the Seg i data read from the buffer unit and the received Seg i' data sent by the data analysis encapsulation module, and updates the Node set communication completion state field and the Rank set communication completion state field;

s8, the computation control sub-module fills the data, the Node set communication completion status field and the Rank set communication completion status field into the corresponding Seg i cache unit;

s9, the calculation control sub-module checks Node set communication completion status field and Rank set communication completion status field, determines whether the Seg i cache unit of the network equipment completes all set communication data calculation related to the equipment, if yes, executes S10; if not, executing S2 to wait for other data related to Seg i to arrive;

s10, the calculation control submodule transmits the Seg i data to the data analysis and encapsulation module for data encapsulation;

s12, the calculation control submodule releases the Seg i cache unit;

s13, the network device completes the aggregate communication cache calculation operation with respect to Seg i.

Technical Field

The invention relates to the technical field of communication, in particular to a method and a system for RDMA communication acceleration set communication.

Background

Remote Direct Access (RDMA) technology is an important data Access mode for high-performance network communication, and RDMA is used for data communication to bypass a host operating system, reduce data copy, and achieve high-performance network communication effects of high bandwidth and low delay. RDMA is widely used in business data centers and high performance computer clusters, while collective communication is a common communication mode in distributed cluster communication.

The data communication mode relates to data transceiving among a plurality of communication nodes, and often relates to a communication process in which data is collected to one computing node by a plurality of computing nodes, such as operations of reduction (Reduce), global reduction (Allreduce), collection (Gather), and global collection (Allgather). In the process of data collection, the data volume in the network is increased sharply, so that the intermediate collection node and the root node become communication hotspots and communication bottlenecks, and then a part of precious computing resources of the host end are consumed when the host end performs protocol computation on data.

Therefore, by offloading the collective communication on the RDMA communication device and calculating the data when the data passes through the communication device, on one hand, the host operating system can be fully bypassed, and the overhead of the host communication and calculation is reduced; on the other hand, data copy can be reduced, the data quantity transmitted in the network is reduced, and the effect of improving the collective communication performance is achieved. At present, network equipment manufacturers open a part of data buffers and computing units for aggregate communication acceleration on network cards or switches to perform data caching and data computing of aggregate communication, and provide supporting software and protocols, such as Core-Direct, SHArP, and the like.

However, in the current RDMA network device aggregate communication acceleration scheme, a simple buffer is adopted for data cache management on a design architecture, data transmission completion feedback is performed at a message granularity, and aggregate communication offload management is performed through upper layer single middle layer software, so that the technical scheme cannot meet the actual aggregate communication requirement in practical application, and the following problems may occur:

(1) when the communication information is large in the data communication process, the feedback of the data transmission condition is carried out on the information layer, so that the occupied time of the buffer area is too long, and the use efficiency of the buffer area is seriously influenced;

(2) the buffer management of coarse granularity lacks high efficiency in data flow processing, and high-efficiency set communication acceleration cannot be performed;

(3) management and calling of aggregate communication unloading are carried out through middle-layer software, so that a user cannot directly sense communication equipment hardware, unloading characteristics of the hardware cannot be fully utilized, and the hardware is required to provide communication primitives and calling interfaces special for aggregate communication unloading for the user.

Particularly in distributed deep learning applications and high performance applications, there are a large number of collective communication operations, such as: distributed deep learning training requires a large amount of parameter updating, for underlying communication, a large number of Allreduce (one of typical relatively complex set communication) operations with a large data volume are required, and a large number of Message Passing Interfaces (MPIs) and set communication interfaces are required to be called in high-performance computing, and the above set communication can significantly affect the overall performance of application, so that a method and a system for accelerating set communication by RDMA communication are urgently needed to solve the problems of the existing set communication.

Disclosure of Invention

One of the objectives of the present invention is to provide an efficient and easy-to-use method for accelerating aggregate communication in RDMA communication, so as to solve the problem of low offload efficiency in the existing aggregate communication.

The invention provides a basic scheme I: a method of accelerating aggregated communications for RDMA communications, comprising the steps of:

s3, the network device acquires the communication metadata of WR or network configuration packet, judges whether the WR or network configuration packet is the collective communication unloading communication, if yes, executes step S4;

s4, the network device acquires the network data, and the unloading task processing of the set communication is carried out on the network device by adopting a fine-grained buffer area management mechanism; the network equipment acquires WR, and before acquiring network data, the network equipment performs configuration of aggregate communication unloading;

and S6, if the task is completed, the application issues the query request or the task is abnormal, the network device feeds back the WR or the CQE corresponding to the network configuration packet to the host or the application.

Description of the drawings: word Request (WR): a work request;

completion Queue Element (CQE): the queue element is completed.

The beneficial effects of the first basic scheme are as follows: the network equipment acquires communication metadata of WR (write forwarding), judges whether the WR or a network configuration packet is set communication unloading communication, and acquires network data if the WR or the network configuration packet is set communication unloading communication, and carries out unloading task processing of set communication on the network equipment by adopting a fine-grained buffer area management mechanism; the network equipment acquires WR, and before acquiring network data, the network equipment performs configuration of aggregate communication unloading; because the network device performs some common network communication tasks besides the aggregate communication offloading, the WR or the network configuration packet is determined first, and the WR or the network configuration packet is the aggregate communication offloading communication to trigger the network device to perform the processing of the aggregate communication, so as to avoid the waste of the network device resources.

Compared with the coarse-grained buffer management in the prior art, the fine-grained buffer management mechanism in the scheme can divide the buffer into small sections (segements) of a plurality of blocks which are less than or equal to a Maximum Transmission Unit (MTU) for management, can improve the flow processing capability, further improve the utilization efficiency of the buffer, and further more efficiently accelerate the collective communication.

When the task is completed, the application issues the query request or the task is abnormal, the network device can feed back the WR or the CQE corresponding to the network configuration packet to the host or the application, and compared with the prior art in which when the communication message in the data communication process is large, the feedback of the data transmission condition is performed at the message level, which may result in overlong occupied time of the buffer area and seriously affect the use efficiency of the buffer area, the scheme returns the execution condition of the collective communication offloading to the host or the application through the CQE, and the user can obtain the execution condition of the collective communication offloading task performed by the network device through the host or the application, and when the complex condition of multiple tasks occurs, the user can accurately obtain the execution state of the existing task, perform more reasonable task scheduling, and ensure the efficient performance of the whole task; therefore, management and calling of aggregate communication unloading are not needed through middle-layer software, and compared with the situation that a user cannot directly sense network equipment hardware due to the fact that an unloading characteristic of hardware equipment such as the network equipment cannot be fully utilized, the hardware is needed to provide communication primitives and a calling interface special for aggregate communication unloading for the user. In addition, the method can not only play a significant role in the tree network, but also be used for accelerating the aggregate communication in the direct connection network.

Further, the S3 further includes: if not, go to S5;

and S5, the network device acquires the network data and performs ordinary network communication processing on the data.

Before S3, the method further includes:

s1, RDMA communication operation is carried out in the application, and data communication requirements are generated;

s2, the application sends WR or network configuration package containing communication cell data to the network equipment, wherein the communication cell data contains operation type, doorbell information, request queue virtual address and length;

after S6, the method further includes:

s7, checking whether all communication operations contained in the communication are completed, if so, executing S8; if the execution is not complete, go to S2;

s8, the communication task execution is ended.

Has the advantages that: the network device can perform some common network communication tasks besides aggregate communication offloading, judge WR, trigger the network device to perform aggregate communication offloading processing only if WR is aggregate communication offloading communication, and acquire network data when WR is not aggregate communication offloading communication, but perform common network communication processing on the data, so that the network device can perform the aggregate communication offloading task without affecting the original network communication function of the network device. If the execution is unsuccessful, step 120 is executed again, i.e. the application issues the WR containing the communicated cell data to the network device again to ensure the completion of the data communication.

Further, the S4 specifically includes:

s401, analyzing configuration information of set communication unloading:

s402, initiating a local data reading request or waiting for receiving external data;

s403, acquiring a data packet containing network data, analyzing the data packet, judging the type of the data packet, and executing S404 if the data packet is the final calculation result of the set communication unloading; if the data to be calculated is the data unloaded by the set communication, executing S405;

s404, broadcasting the final calculation result of the set communication unloading to each Rank related to the set communication in the Node;

s405, performing block calculation on data to be calculated;

s406, judging whether all the Rank data related to the set communication in the Node are calculated, and if so, executing S407; if the calculation is not completed, executing S402;

s407, if all the ranks related to the set communication are in the Node, directly broadcasting the calculation result in the Node; if not, packaging the data completing the aggregate communication calculation in the Node, and sending the data to the next Node and other nodes for further calculation;

s408, checking the unloading completion state of the set communication, and if the unloading completion state is completed or abnormal, executing S6; if not, go to S402.

Description of the drawings: node: a host node containing one or more computational communication entities;

rank: one or more computing communication entities contained within the host node.

Has the advantages that: the data packet acquired by the set communication unloading can judge the type of the data packet, prevent excessive calculation, judge Node information, and package and send data to other nodes if the ranks are not in the Node, thereby ensuring that all the ranks finish the set communication unloading.

Further, the configuration information comprises important control information; the important control information includes: op field, JID field, MID field, SID field, Node field, Rank field, Seg field, data block length field, aggregate communication execution state field, address linked list field, routing information field.

Description of the drawings: an Op field for indicating a type of computation operation used in the collective communication, comprising: summing and solving the most value;

a JID field for a flag of a task number;

an MID field for a flag of a message number;

SID field, used for marking data sequence number in the same message;

a Node field, which is used for marking the number of the Node involved in one set communication, the position corresponding to the Node involved in the set communication is 1, and the position corresponding to the Node not involved in the set communication is 0;

a Rank field is used for marking a Rank number related in primary set communication, the position corresponding to the Rank in the node related to the set communication is 1, and the position corresponding to the Rank not related to the set communication is 0;

a Seg field for a flag of a Seg cache unit number designated in a primary set communication;

a data block length field for indicating how large the data of a message is divided into data blocks for pipeline processing;

an aggregate communication execution state field for indicating an execution state in an aggregate communication;

the address linked list field is used for storing the address linked list for acquiring the Rank data on the network equipment;

and the routing information field is used for storing the routing information of the network equipment for data forwarding.

Opcode (op): the operation type is used for collecting the communication marks and calculating the operation types of data such as summation and the most value in communication;

job _ ID (JID): a task number that allocates an independent JID number when there are a plurality of collective communication offload tasks;

message _ id (mid): the message number is the serial number of a plurality of messages in the same set communication unloading task;

sequence _ id (sid): sequence number, large messages are split into multiple sequential processes.

Has the advantages that: the important control information is key information provided in the set communication uninstalling primitive supporting the ib _ verbs-based expansion; the user can directly control the network equipment to carry out the unloading of the aggregate communication on the message through the expanded aggregate communication unloading primitive, thereby improving the programmability of the user and utilizing the unloading characteristic of the network equipment to a greater extent.

Further, the block calculation is performed for each Seg i cache unit, and includes:

s4051, starting to perform cache calculation;

s4052, receiving Seg i' data;

s4053, detecting whether the Seg i cache unit for calculation is empty, if so, indicating that the received Seg i' data is the first block of data for the Seg i cache unit to perform set communication calculation, and executing S4054; if not, the Seg i cache unit already has part of the set communication data to be calculated, and S4057 is executed;

s4054, obtaining Op field, Node field and Rank field corresponding to Seg i cache unit;

s4056, filling the acquired Op field, Node field and Rank field corresponding to Seg i and the received Seg i data into a Seg i cache unit;

s4057, reading the data of the Seg i cache unit, and extracting an Op field for judging the type of the calculation operation to be performed;

s4058, the calculation operation appointed by Op field is carried out on the read Seg i data and the received Seg i' data, and the Node set communication completion state field and the Rank set communication completion state field are updated;

s4059, filling the data, the Node set communication completion status field and the Rank set communication completion status field into the corresponding Seg i cache unit;

s40510, checking Node set communication completion status field and Rank set communication completion status field, judging whether the Seg i cache unit of the network equipment completes all set communication data calculation related to the network equipment, if yes, executing S40511; if not, executing S4052, and waiting for other data related to Seg i to arrive;

s40511, data encapsulation is carried out on the Seg i data;

s40512, releasing the Seg i cache unit;

s40513, the network device completes the aggregate communication cache calculation operation with respect to Seg i.

Description of the drawings: segment (segment): a plurality of independent fine-grained units divided in a buffer.

Has the advantages that: each Seg i cache unit is subjected to buffer calculation, namely, aggregate communication unloading, so that aggregate communication is accelerated, aggregate communication efficiency is improved, data is judged in the calculation process, a more reasonable calculation process is carried out, and the Seg i cache unit is released immediately after calculation is finished, so that the next calculation is facilitated; the utilization rate of the buffer area and the unloading efficiency of the aggregate communication can be improved, the resource occupation of the host end to the aggregate communication is greatly released, and the overall communication performance is improved.

The second purpose of the present invention is to provide an efficient and easy-to-use system for RDMA communication acceleration aggregate communication, so as to solve the problem of low offload efficiency of the existing aggregate communication.

The invention provides a second basic scheme: a system for RDMA communication acceleration set communication comprises an execution control module, a data processing module and a state updating module;

the execution control module is used for acquiring WR or network configuration packets, analyzing and storing the WR or network configuration packets, and sending the analyzed configuration information of the set communication unloading to the data module group and the state updating module;

the data processing module group is used for carrying out unloading task processing of collective communication on the network equipment by adopting a fine-grained buffer area management mechanism;

and the state updating module is used for writing the completion state information back to the host end through the completion event queue.

The second basic scheme has the beneficial effects that: in the system, the execution control module acquires the WR or the network configuration packet, analyzes and stores the WR or the network configuration packet, and issues the analyzed configuration information of the set communication unloading to the data module group and the state updating module, but does not issue the configuration information of the set communication unloading to the data module group and the state updating module, but performs common network communication processing in the prior art, and details are not repeated in the scheme. The data processing module group adopts a fine-grained buffer management mechanism to carry out unloading task processing of aggregate communication on the network equipment according to the configuration information issued by the execution control module, and compared with the coarse-grained buffer management in the prior art, the fine-grained buffer management mechanism in the system is lack of high efficiency in data flow processing and cannot carry out high-efficiency aggregate communication acceleration.

The system returns the execution condition of the collective communication unloading to the host end through the state updating module, and a user can acquire the execution condition of the collective communication unloading task of the network equipment through the host or an application; therefore, management and calling of aggregate communication unloading are not needed through middle-layer software, compared with the situation that a user cannot directly sense network equipment hardware due to the fact that an intermediate software layer cannot fully utilize unloading characteristics of hardware equipment such as network equipment, communication primitives and calling interfaces special for aggregate communication unloading are needed to be provided for the user through hardware, aggregate communication unloading management is not needed to be conducted on upper-layer single middle-layer software in the system, aggregate communication unloading tasks are conducted in the network equipment, a state updating module in the network equipment feeds back CQEs corresponding to WRs to a host or applications, the user can directly manage the aggregate communication unloading tasks conducted in the network equipment through the host, and therefore the system is convenient to use. In addition, the system can not only play a significant effect in the tree network, but also be used for accelerating aggregate communication in the direct connection network.

Further, the data processing module group includes: the data transmission system comprises a data cache calculation module, a data analysis and encapsulation module and a data transmission module;

the data cache calculation module is used for carrying out block calculation on data related to the set communication;

the data analysis and encapsulation module is used for analyzing and splitting data acquired from a network port or a local memory according to the configuration information issued by the execution control module and sending the data to the data cache calculation module for processing; the data transmission module is also used for packaging the data after all the set communication calculations are completed and sending the data to the data transmission module;

the data transmission module is used for reading data from the Node local memory, writing the data into the Node local memory, sending the data to an external port and receiving the data from the external port; and the data communication module is also used for sending the completion condition of the data communication to the state updating module after the communication transmission is finished each time.

Has the advantages that: a set of data processing modules comprising: the data cache calculation module, the data analysis packaging module and the data transmission module are matched with each other to complete set communication unloading as a core module for executing set communication unloading, and the data cache calculation module, the data analysis packaging module and the data transmission module guarantee calculation and transmission of data.

Further, the configuration information comprises important control information; the important control information includes: op field, JID field, MID field, SID field, Node field, Rank field, Seg field, data block length field, aggregate communication execution status field, address linked list field, and routing information field.

Description of the drawings: an Op field for indicating a type of computation operation used in the collective communication, comprising: summing, solving the most value and solving the mean value for the data caching calculation module and the data analysis encapsulation module to use;

a JID field used for marking a task number for a data analysis packaging module;

the MID field is used for marking the message number and is used by the data analysis packaging module;

SID field, used for marking data sequence number in the same message, for data analysis and encapsulation module;

a Node field, which is used for marking the number of the Node involved in the primary set communication, the position corresponding to the Node involved in the set communication is 1, and the position corresponding to the Node not involved is 0, and is used by a data cache calculation module and a data analysis encapsulation module;

the Rank field is used for marking a Rank number related in primary set communication, the position corresponding to the Rank in the node related to the set communication is 1, and the position corresponding to the Rank not related to the set communication is 0, so that the Rank field is used by the data cache calculation module and the data analysis and encapsulation module;

the Seg field is used for marking a specified Seg cache unit number in primary set communication, and the resource use of the Seg is managed and controlled by an upper layer driver and is used by a data cache calculation module and a data analysis and encapsulation module;

the data block length field is used for indicating the data of a message is divided according to the data block size to carry out pipeline processing and is used by the data cache calculation module and the data analysis and encapsulation module;

the integrated communication execution state field is used for indicating the execution state in one integrated communication, including the completion success and the buffer error, and is used by the data transmission module and the state updating module;

the address linked list field is used for storing the address linked list for acquiring the Rank data on the network equipment and is used by the data transmission module;

and the routing information field is used for storing the routing information of data forwarding of the network equipment for the data transmission module to use.

Has the advantages that: the important control information is key information provided in the set communication uninstalling primitive supporting the ib _ verbs-based expansion; the user can directly control the network equipment to carry out the unloading of the aggregate communication on the message through the expanded aggregate communication unloading primitive, thereby improving the programmability of the user and utilizing the unloading characteristic of the network equipment to a greater extent.

Further, the data cache calculation module includes: a calculation control submodule and a cache submodule;

the cache submodule comprises a plurality of Seg cache units which take Seg as segment bits; the Seg cache unit is used for caching data in the calculation control submodule and storing and calculating the data by taking the Seg as a unit;

and the calculation control submodule is used for acquiring configuration information from the execution control module, reading Seg data from the buffer submodule, finishing calculation of input data of the same Seg and original data in the Seg buffer unit by using a vector calculation component in the module, updating corresponding Seg data bits and storing calculated data after calculation is finished, releasing a Seg j unit after all Ranks related to the Seg j in the node finish calculation of set communication, and transmitting the Seg j which is finished calculation to the data analysis and packaging module for subsequent packaging processing.

Has the advantages that: the data cache calculation module comprises a cache submodule and a calculation control submodule, wherein the cache submodule comprises a plurality of Seg cache units so as to divide a buffer area into a plurality of independent fine-grained units, each Seg i cache unit is used for performing cache calculation, a vector calculation component matched with the block size of the buffer area is used for unloading simple calculation involved in aggregate communication to network equipment, and calculation resources occupied by the aggregate communication of a host side can be further released.

Further, each Seg includes a plurality of data fields, including: op field, Node set communication completion state field, Rank set communication completion state field and data load field;

the data cache calculation module is used for performing block calculation on data involved in set communication, and comprises: each segi cache unit performs cache calculation, specifically:

s1, the data cache calculation module starts to perform cache calculation;

s2, receiving Seg i' data sent by the data analysis and encapsulation module;

s4, the calculation control sub-module detects whether the Seg i cache unit is empty, if so, the calculation control sub-module indicates that the received Seg i' data is the first block data of the Seg i cache unit for aggregate communication calculation, and executes S4; if not, it indicates that the Seg i cache unit already has part of the set communication data to perform calculation, and executes S6;

s4, the calculation control sub-module obtains Seg i from the execution control moduleCache unitCorresponding to Op field, Node field and Rank field;

s5, the calculation control sub-module fills the Op field, the Node field and the Rank field corresponding to the Seg i and the received Seg i data sent by the data analysis encapsulation module into a Seg i cache unit;

s6, the calculation control sub-module reads the data of the Seg i cache unit from the cache sub-module, and extracts an Op field for judging the type of the calculation operation to be performed;

s7, the calculation control sub-module performs calculation operation appointed by Op field on the Seg i data read from the buffer unit and the received Seg i' data sent by the data analysis encapsulation module, and updates the Node set communication completion state field and the Rank set communication completion state field;

s8, the computation control sub-module fills the data, the Node set communication completion status field and the Rank set communication completion status field into the corresponding Seg i cache unit;

s9, the calculation control sub-module checks Node set communication completion status field and Rank set communication completion status field, determines whether the Seg i cache unit of the network equipment completes all set communication data calculation related to the equipment, if yes, executes S10; if not, executing S2 to wait for other data related to Seg i to arrive;

s10, the calculation control submodule transmits the Seg i data to the data analysis and encapsulation module for data encapsulation;

s12, the calculation control submodule releases the Seg i cache unit;

s13, the network device completes the aggregate communication cache calculation operation with respect to Seg i.

Description of the drawings: an Op field, used for indicating the type of the computing operation used in the current set communication of the Seg cache unit, includes: summing, solving the most value and solving the average value;

node set communication completion state field, used to express communication completion state of data to be calculated from other nodes;

a Rank set communication completion state field used for representing the completion state of the data to be calculated from each Rank in the Node;

and the data load field is used for storing the data load with the fixed length after the message is divided into blocks, and the calculation and the pipeline processing of communication are performed according to the data length.

Has the advantages that: each Seg i cache unit is subjected to buffer calculation, namely, aggregate communication unloading, so that aggregate communication is accelerated, aggregate communication efficiency is improved, all modules are matched to perform a more reasonable calculation process in the calculation process, and the Seg i cache unit is released immediately after calculation is completed, so that the next calculation is facilitated; the utilization rate of the buffer area and the unloading efficiency of the aggregate communication can be improved, the resource occupation of the host end to the aggregate communication is greatly released, and the overall communication performance is improved.

Drawings

FIG. 1 is a flowchart of a first embodiment of a method for accelerated set communication for RDMA communication according to the present invention;

FIG. 2 is a diagram illustrating important control information in an embodiment of a method for accelerating aggregate RDMA communication according to the present invention;

FIG. 3 is a diagram illustrating data fields included in each Seg according to an embodiment of the method for accelerated aggregated RDMA communication;

FIG. 4 is a logic block diagram of a second embodiment of a system for RDMA accelerated aggregation communication in accordance with the present invention;

FIG. 5 is a logic diagram of a data cache computing module in a second embodiment of the system for RDMA accelerated aggregation communication of the present invention;

FIG. 6 is a flowchart illustrating the execution of the set communication offload by the data cache calculation module with respect to Seg i according to a second embodiment of the system for RDMA accelerated set communication;

FIG. 7 is a flow diagram of packet processing by a single network device during Allreduce aggregate traffic offload;

fig. 8 is a schematic view of a 4-node network topology set communication offload.

Detailed Description

The following is further detailed by way of specific embodiments:

description of related concepts and words:

allreduce: one of the typical relatively complex aggregate communications, the execution flow of the subsequent aggregate communication offload will be described herein by taking this operation as an example;

example one

In this embodiment, as shown in fig. 1, a method for accelerating aggregate RDMA communication includes the following steps:

s1, RDMA communication operation is carried out in the application, and data communication requirements are generated;

s2, the application sends WR or network configuration package containing communication cell data to the network equipment, wherein the communication cell data contains operation type, doorbell information, request queue virtual address and length; the operation types include: offloading of aggregate traffic and normal network traffic;

s3, the network device acquires the communication metadata of WR or network configuration packet, judges whether the WR or network configuration packet is the collective communication unloading communication, if yes, executes step S4; if not, go to S5;

s4, the network device acquires the network data, and the unloading task processing of the set communication is carried out on the network device by adopting a fine-grained buffer area management mechanism; the network equipment acquires WR, and before acquiring network data, the network equipment performs configuration of aggregate communication unloading; s4 specifically includes:

s401, analyzing configuration information of set communication unloading; the configuration information comprises important control information; as shown in fig. 2, the important control information includes:

an Op field for indicating a type of computation operation used in the collective communication, comprising: summing and solving the most value; extracting Op fields from WRs;

a JID field for a flag of a task number;

an MID field for a flag of a message number;

SID field, used for marking data sequence number in the same message;

a Node field used for marking the number of the Node involved in the primary aggregate communication, wherein the Node field is extracted from WR or network configuration packet, the position corresponding to the Node involved in the aggregate communication is 1, and the position corresponding to the Node not involved in the aggregate communication is 0;

the Rank field is used for marking a Rank number related in primary set communication, the Rank field is extracted from WR, the position corresponding to the Rank in the node related to the set communication is 1, and the position corresponding to the Rank not related to the set communication is 0;

a Seg field for a flag of a Seg cache unit number designated in primary aggregate communication, the Seg field extracted from WR, resource usage of Seg managed and controlled by an upper driver;

a data block length field for indicating how large the data of a message is divided into data blocks for pipeline processing; the data block length field is obtained from WR or configuration packet;

an aggregate communication execution status field for indicating an execution status in an aggregate communication, including a completion success and a buffer error;

the address linked list field is used for storing the address linked list for acquiring the Rank data on the network equipment; extracting an address linked list field from WR;

a routing information field for storing routing information for data forwarding of the network device; the routing information field is extracted from the WR or the data parsing encapsulation module.

S402, initiating a local data reading request or waiting for receiving external data;

s403, acquiring a data packet containing network data, analyzing the data packet, judging the type of the data packet, and executing S404 if the data packet is the final calculation result of the set communication unloading; if the data to be calculated is the data unloaded by the set communication, executing S405;

s404, broadcasting the final calculation result of the set communication unloading to each Rank related to the set communication in the Node;

s405, performing block calculation on data to be calculated; wherein the block calculation is performed for each Seg i cache unit, and comprises the following steps:

s4051, starting to perform cache calculation;

s4052, receiving Seg i' data;

s4053, detecting whether the Seg i cache unit for calculation is empty, if so, indicating that the received Seg i' data is the first block of data for the Seg i cache unit to perform set communication calculation, and executing S4054; if not, the Seg i cache unit already calculates part of the set communication data, and executes S4056;

s4054, obtaining Seg iCache unitIn Op, Node and Rank fields; each Seg contains a plurality of data fields, as shown in fig. 3, including:

an Op field, used for indicating the type of the computing operation used in the current set communication of the Seg cache unit, includes: summing, solving the most value and solving the average value; the Op field in the current Op field and the Op field in the important control information are the same Op field.

Node set communication completion state field, used to express communication completion state of data to be calculated from other nodes;

a Rank set communication completion state field used for representing the completion state of the data to be calculated from each Rank in the Node;

a data load field for storing a fixed-length data load after message blocking, and performing pipeline processing of calculation and communication with the data length each time;

s4055, filling the Op field, the Node field and the Rank field corresponding to the Seg i and the received Seg i data into a Seg i cache unit;

s4056, reading the data of the Seg i cache unit, and extracting an Op field for judging the type of the calculation operation to be performed;

s4057, performing Op field specified calculation operation on the read Seg i data and the received Seg i' data, and updating Node set communication completion status fields and Rank set communication completion status fields;

s4058, filling the data, the Node set communication completion status field and the Rank set communication completion status field into the corresponding Seg i cache unit;

s4059, checking Node set communication completion status field and Rank set communication completion status field, judging whether Seg i cache unit of the network equipment completes all set communication data calculation related to the network equipment, if yes, executing S40510; if not, executing S4052, and waiting for other data related to Seg i to arrive;

s40510, encapsulating the Seg i data;

s40511, releasing the Seg i cache unit;

s40512, the network device completes the aggregate communication cache calculation operation with respect to Seg i.

S406, judging whether all the Rank data related to the set communication in the Node are calculated, and if so, executing S407; if the calculation is not completed, executing S402;

s407, if all the ranks related to the set communication are in the Node, directly broadcasting the calculation result in the Node; if not, packaging the data completing the aggregate communication calculation in the Node, and sending the data to the next Node and other nodes for further calculation;

s408, checking the unloading completion state of the set communication, and if the unloading completion state is completed or abnormal, executing S6; if not, go to S402.

S5, the network device acquires the network data and carries out common network communication processing on the data;

s6, if the task is completed, the application issues the query request or the task is abnormal, the network device feeds back the WR or the CQE corresponding to the network configuration packet to the host or the application;

s7, checking whether all communication operations contained in the communication are completed, if so, executing S8; if the execution is not complete, go to S2;

s8, the communication task execution is ended.

Example two

This embodiment is substantially as shown in figure 4: a system for RDMA communication accelerated aggregation communication, comprising: the system comprises an execution control module, a data processing module and a state updating module; wherein the data processing module group comprises: the device comprises a data cache calculation module, a data analysis and encapsulation module and a data transmission module. The system comprises an execution control module, a data analysis and encapsulation module, a data cache calculation module, a data transmission module and a state updating module, wherein the five modules are core modules for executing the set communication unloading, and the complete set communication unloading in the RDMA communication is completed and needs to be supported by other modules in communication equipment, including but not limited to a communication context management module, a virtual-real address conversion module and a host communication interface module, and the modules are conventional communication functional modules and are not described herein again.

The execution control module is used for acquiring WR or network configuration packets, analyzing and storing the WR or network configuration packets, and sending the analyzed configuration information of the set communication unloading to the data analysis packaging module, the data cache calculation module, the data transmission module and the state updating module; the configuration information includes: important control information; as shown in fig. 2, the important control information includes the following fields:

an Op field for indicating a type of computation operation used in the collective communication, comprising: summing and solving the most value, and extracting Op fields from WR for use by a data cache calculation module and a data analysis and encapsulation module;

a JID field used for marking a task number for a data analysis packaging module;

the MID field is used for marking the message number and is used by the data analysis packaging module;

SID field, used for marking data sequence number in the same message, for data analysis and encapsulation module;

a Node field used for marking the number of the Node involved in the primary set communication, wherein the Node field is extracted from WR or network configuration packet, the position corresponding to the Node involved in the set communication is 1, and the position corresponding to the Node not involved is 0, which is used by the data cache calculation module and the data analysis encapsulation module;

the Rank field is used for marking a Rank number related in primary set communication, the Rank field is extracted from WR, the position corresponding to the Rank in the node related to the set communication is 1, and the position corresponding to the Rank not related to the set communication is 0, and the Rank field is used by a data cache calculation module and a data analysis and encapsulation module;

the Seg field is used for marking the designated Seg cache unit number in primary set communication, the Seg field is extracted from WR, and the resource use of the Seg is managed and controlled by an upper layer driver and is used by a data cache calculation module and a data analysis and encapsulation module;

the data block length field is used for indicating the data of a message to be divided according to the data block size for pipeline processing, and the data block length field is obtained from a WR or a configuration packet and is used by a data cache calculation module and a data analysis and encapsulation module;

the integrated communication execution state field is used for indicating the execution state in one integrated communication, including the completion success and the buffer error, and is used by the data transmission module and the state updating module;

the address linked list field is used for storing an address linked list for acquiring each Rank data on the network equipment, and is extracted from WR for the data transmission module to use;

and the routing information field is used for storing routing information for data forwarding of the network equipment, and is extracted from the WR or the data analysis and encapsulation module and used by the data transmission module.

The data processing module group is used for carrying out unloading task processing of collective communication on the network equipment by adopting a fine-grained buffer area management mechanism; wherein the data processing module group comprises: the data transmission system comprises a data cache calculation module, a data analysis and encapsulation module and a data transmission module;

the data cache calculation module is used for carrying out block calculation on data related to the set communication and realizing efficient management on a buffer area on the network equipment; taking an Allreduce operation as an example, the data cache calculation module selects corresponding buffer areas for data from each Rank in each Node according to different segs, and performs a specification operation according to an Op field acquired from the execution control module. The data cache calculation module, as shown in fig. 5, includes: a calculation control submodule and a cache submodule;

the cache submodule comprises a plurality of Seg cache units which take Seg as segment bits; the Seg cache unit is used for caching data in the calculation control submodule and storing and calculating the data by taking the Seg as a unit; each Seg contains a plurality of data fields, as shown, including: op field, Node set communication completion state field, Rank set communication completion state field and data load field;

an Op field, used for indicating the type of the computing operation used in the current set communication of the Seg cache unit, includes: summing, solving the most value and the average value, and determining the type of the calculation operation performed by the Seg cache unit and the new data through the Op field by the new data;

node set communication completion state field, used to express communication completion state of data to be calculated from other nodes; the method specifically comprises the following steps: the initial state of the Node set communication completion state field is acquired from the execution control unit when the Seg cache unit receives first data, when receiving data to be calculated from other nodes and completing calculation in the network equipment, the position of a corresponding Node state mark in the Node set communication completion state field is adjusted to be 0, otherwise, the existing state is maintained; when the Node status flag positions in the Node aggregate communication completion status field are all 0, the calculation of all Node aggregate communication data related to the Seg cache unit is completed, and the data can be distributed to each Node;

a Rank set communication completion state field used for representing the completion state of the data to be calculated from each Rank in the Node; the method specifically comprises the following steps: the initial state of the Rank set communication completion state field is acquired from the execution control unit when the Seg cache unit receives first data, and when data to be calculated from each Rank in the Node are received and calculation is completed in the network equipment, the position of a state mark corresponding to the Rank in the Rank set communication completion state field is adjusted to be 0, otherwise, the existing state is kept; when the Rank state flag bit in the Rank set communication completion state field is 0, the calculation of all the Rank set communication data in the Node related to the Seg cache unit is completed, and the data can be sent to the next-stage network equipment to continue the calculation of the set communication data;

and the data load field is used for storing the data load with the fixed length after the message is divided into blocks, and the calculation and the pipeline processing of communication are performed according to the data length.

And the calculation control submodule is used for acquiring configuration information from the execution control module, reading Seg data from the buffer submodule, finishing calculation of input data of the same Seg and original data in the Seg buffer unit by using a vector calculation component in the module, updating corresponding Seg data bits and storing calculated data after calculation is finished, releasing a Seg j unit after all Ranks related to the Seg j in the node finish calculation of set communication, and transmitting the Seg j which is finished calculation to the data analysis and packaging module for subsequent packaging processing.

As shown in fig. 6, the data cache calculation module performs an execution flow of set communication offload on Seg i, specifically:

s1, the data cache calculation module starts to perform cache calculation;

s2, receiving Seg i' data sent by the data analysis and encapsulation module;

s4, the calculation control sub-module detects whether the Seg i cache unit is empty, if so, the calculation control sub-module indicates that the received Seg i' data is the first block data of the Seg i cache unit for aggregate communication calculation, and executes S4; if not, it indicates that the Seg i cache unit already has part of the set communication data to perform calculation, and executes S6;

s4, the calculation control sub-module acquires Op fields, Node fields and Rank fields corresponding to the Seg i cache unit from the execution control module;

s5, the calculation control sub-module fills the Op field, the Node field and the Rank field corresponding to the Seg i and the received Seg i data sent by the data analysis encapsulation module into a Seg i cache unit;

s6, the calculation control sub-module reads the data of the Seg i cache unit from the cache sub-module, and extracts an Op field for judging the type of the calculation operation to be performed;

s7, the calculation control sub-module performs calculation operation appointed by Op field on the Seg i data read from the buffer unit and the received Seg i' data sent by the data analysis encapsulation module, and updates the Node set communication completion state field and the Rank set communication completion state field;

s8, the computation control sub-module fills the data, the Node set communication completion status field and the Rank set communication completion status field into the corresponding Seg i cache unit;

s9, the calculation control sub-module checks Node set communication completion status field and Rank set communication completion status field, determines whether the Seg i cache unit of the network equipment completes all set communication data calculation related to the equipment, if yes, executes S10; if not, executing S2 to wait for other data related to Seg i to arrive;

s10, the calculation control submodule transmits the Seg i data to the data analysis and encapsulation module for data encapsulation;

s12, the calculation control submodule releases the Seg i cache unit;

s13, the network device completes the aggregate communication cache calculation operation with respect to Seg i.

The data analysis and encapsulation module is used for analyzing and splitting data acquired from a network port or a local memory according to the configuration information issued by the execution control module and sending the data to the data cache calculation module for processing; the data transmission module is also used for packaging the data after all the set communication calculations are completed and sending the data to the data transmission module; compared with the common RDMA communication, the data analysis and encapsulation module in the scheme adds Op fields, JID fields, MID fields, SID fields, Node fields, Rank fields and Seg fields to the analyzed and encapsulated information. The analysis and encapsulation of the common network communication processing can also be carried out by the module.

The data transmission module is used for reading data from the Node local memory, writing the data into the Node local memory, sending the data to an external port and receiving the data from the external port; and the data communication module is also used for sending the completion condition of the data communication to the state updating module after the communication transmission is finished each time.

And the state updating module is used for writing the completion state information back to the host end through the completion event queue. After the data transmission module completes one WR, the completion state information of the WR is sent to the state updating module, compared with the return state of common RDMA communication, the completion state field of set communication unloading is added to the state updating module, and the completion state information is written back to the host end by the state updating module through a host communication interface, such as DMA and the like.

The specific implementation process comprises the following steps: as shown in fig. 7: the flow of data packet processing is carried out when Allreduce set communication unloading is carried out to a single network device, wherein the network device is a network card device:

step 701, starting collective communication;

step 702, the application issues a set communication offload (WR) to a network card;

step 703, the network card acquires the set communication offload WR through the host communication interface, and the execution control module analyzes the configuration information of the set communication offload;

step 704, the data transmission module initiates a local data reading request or waits to receive external data through the host communication interface;

705, after acquiring a data packet containing network information, analyzing the data packet by a data analyzing and packaging module, judging the type of the data packet, and if the data packet is common network communication data, executing 706; if the final calculation result of Allreduce uninstallation is obtained, go to step 707; if the Allreduce set communication data to be calculated, execute step 708;

step 706, packaging the data according to the processing flow of the common RDMA communication data, and sending out or writing the data into the local memory through the internet access;

step 707, broadcasting the final calculation result of Allreduce unloading to each Rank related to the Allreduce set communication in the node through the host communication interface by the data transmission module;

step 708, the data cache calculation module performs block calculation on the communication data to be calculated;

step 709, judging whether all the Rank data related to the Allreduce set communication in the node are calculated, and if so, executing step 710; if the calculation is not completed, go to step 704;

step 710, if all the ranks related to the set communication are in the node, directly broadcasting the calculation result in the node; if not, packaging the data which completes the collective communication calculation in the node, and sending the data to the next node and other nodes for further calculation;

step 711, checking the unloading completion state of the whole set communication, and if the unloading completion state is completed or abnormal, executing step 712; if not, step 704 is entered to wait for data to arrive and initiate a data read request.

Step 712, the status update module returns the communication completion status to the host memory through the host communication interface;

step 713, this Allreduce set communication is completed.

As shown in fig. 8: taking 4 Node network topologies as an example, all reduce aggregate communication offload is performed, and there are 4 nodes in fig. 8, and each Node has 4 ranks participating in aggregate communication.

Assuming that the switch and the 4 network cards shown in fig. 8 are all network devices supporting set communication offload, a solid one-way line represents a data flow for performing data aggregation calculation, and a dashed one-way line represents a data flow for performing aggregation result return. The topology shown in fig. 8 mainly goes through three stages when performing Allreduce collective communication: firstly, issuing a network card set communication unloading task and configuring a switch set communication unloading task; secondly, the 4 network cards respectively perform block calculation and pipeline processing on the data of each Rank in each node, the intermediate calculation results are transmitted to the switch, and the switch performs stream calculation on the data packets from the 4 network cards to obtain final calculation results; and finally, the switch broadcasts the final result to each network card, the network cards judge that the broadcasting packets are unloaded after acquiring the data packets, the final data are distributed to the address space corresponding to each Rank, and the Allreduce set communication is completed.

Assuming that the switch shown in fig. 8 does not support offloading of aggregate communication, and 4 network cards all support offloading of aggregate communication, the 4 network cards supporting offloading of aggregate communication may form a Ring as indicated by a dashed arrow through the switch or a direct connection network, and perform aggregate communication through Ring algorithm. The topology shown in fig. 8 mainly goes through three stages when performing Allreduce collective communication: firstly, issuing a network card set communication unloading task and configuring set communication unloading; secondly, the 4 network cards respectively perform block calculation and pipeline processing on the data of each Rank in each node, and transmit the intermediate calculation result to other nodes through a switch or a direct connection network for calculation; and finally, each network card can obtain a final data result of the collective communication, and the final data is distributed to the address space corresponding to each Rank, so that the Allreduce collective communication is completed.

The foregoing is merely an example of the present invention, and common general knowledge in the field of known specific structures and characteristics is not described herein in any greater extent than that known in the art at the filing date or prior to the priority date of the application, so that those skilled in the art can now appreciate that all of the above-described techniques in this field and have the ability to apply routine experimentation before this date can be combined with one or more of the present teachings to complete and implement the present invention, and that certain typical known structures or known methods do not pose any impediments to the implementation of the present invention by those skilled in the art. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

24页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于I~(2)C总线的综合电子平台多模块分时上电方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!