Graph calculation optimization method based on heterogeneous FPGA data flow

文档序号：1543980 发布日期：2020-01-17 浏览：21次中文

阅读说明：本技术 一种基于异构fpga数据流的图计算优化方法 (Graph calculation optimization method based on heterogeneous FPGA data flow ) 是由廖小飞陈庆祥郑龙金海� 姚鹏程于 2019-09-29 设计创作，主要内容包括：本发明涉及一种基于异构FPGA数据流的图计算优化方法,FPGA处理模块读取存储于共享存储模块中的图数据Graph data,并且在对图数据进行图计算Graph Processing之前基于图数据的幂律分布特性凭借FPGA处理模块的可编辑性对图数据进行遍历以获取图数据中存在的不规则数据流的至少一个不规则性参数,调度模块按照至少一个不规则形参数与其契合的预设的访问规则进行匹配的方式将契合CPU处理模块的图数据的部分分配至CPU处理模块进行图计算以获取CPU图计算结果以及将剩下的另一部分分配至具有并行处理功能的FPGA处理模块进行图计算以获取FPGA图计算结果。该方法能够在对图数据加速处理的过程中尽量平衡CPU处理模块和FPGA处理模块的处理负载。(The invention relates to a Graph calculation optimization method based on heterogeneous FPGA (field programmable gate array) data streams, wherein an FPGA Processing module reads Graph data stored in a shared storage module, the Graph data is traversed by means of editability of the FPGA Processing module based on power law distribution characteristics of the Graph data before Graph calculation Graph Processing is carried out on the Graph data so as to obtain at least one irregularity parameter of an irregular data stream existing in the Graph data, a scheduling module distributes part of the Graph data fitting with the CPU Processing module to the CPU Processing module for Graph calculation according to a mode that the at least one irregularity parameter is matched with a preset access rule fitting with the CPU Processing module so as to obtain a CPU Graph calculation result, and distributes the other remaining part to the FPGA Processing module with a parallel Processing function for Graph calculation so as to obtain the FPGA Graph calculation result. The method can balance the processing load of the CPU processing module and the FPGA processing module as much as possible in the process of accelerating the processing of the graph data.)

1. A graph calculation optimization method based on heterogeneous FPGA data flow, in particular to a data flow graph calculation optimization method based on CPU + FPGA heterogeneity, which can respectively send graph data to a CPU processing module (100) and an FPGA processing module (200) in a scheduling mode for dynamic acceleration processing based on the power law distribution characteristics of a graph, so that the FPGA processing module (200) can cooperate with the CPU processing module (100) to process irregular data flow in the graph data to obtain a graph calculation result;

the method comprises the following steps:

the shared storage module (300) stores graph data to be subjected to graph calculation processing;

the CPU processing module (100) and the FPGA processing module (200) are in communication connection with the shared storage module (300) in a parallel mode respectively, so that the CPU processing module (100) and the FPGA processing module (200) can read at least one part of the graph data in real time respectively, and the CPU processing module (100) and/or the FPGA processing module (200) perform graph calculation on at least one part of the graph data read respectively so as to obtain a CPU graph calculation result and/or an FPGA graph calculation result respectively;

an integration module (400) integrates the CPU graph calculation results and/or the FPGA graph calculation results to obtain the graph calculation results,

it is characterized in that the preparation method is characterized in that,

the FPGA processing module (200) reads graph data stored in the shared memory module (300) and traverses the graph data by means of editability of the FPGA processing module (200) based on power-law distribution characteristics of the graph data to obtain at least one irregularity parameter of an irregular data flow present in the graph data before graph computation on the graph data,

the scheduling module (500) allocates a part of the graph data fitting the CPU processing module (100) to the CPU processing module (100) for graph calculation so as to obtain a CPU graph calculation result and allocates the other remaining part to the FPGA processing module (200) with a parallel processing function for graph calculation so as to obtain an FPGA graph calculation result in a manner of matching the at least one irregular shape parameter with a preset access rule fitting the at least one irregular shape parameter.

2. The optimization method according to claim 1, characterized in that the scheduling modules (500) are communicatively connected to each other in such a way that local access can be performed with the FPGA processing module (200) having at least two threads that can be processed in parallel, so that the FPGA processing module (200) writes a part of the graph data which is not fit for the CPU processing module (100) to the FPGA internal buffers (200a) corresponding to the at least two threads in the FPGA processing module (200) in a parallel processing mode in the process of traversing the graph data, therefore, the FPGA processing module (200) as a heterogeneous graph computation accelerator of the CPU processing module (100) can perform parallel graph computation on a part of graph data which is not fit for the CPU processing module (100) to process in a mode of saving communication time and writing time.

3. The optimization method according to claim 1 or 2, characterized in that at least one thread in the FPGA processing module (200) is configured to be able to traverse the graph data according to the traversal rules based on the editability of the FPGA processing module (200), and the at least one thread is communicatively connected with the scheduling module (500) via a local cache (200b),

the step of traversing the at least one thread comprises at least:

selecting at least one sub-point from the graph data as a starting point to start traversing;

reading the sub-point data of the at least one sub-point;

reading the adjacent target sub-points and the edges of the two adjacent sub-points according to the structure in the graph data;

reading sub-point data of the target sub-point and edge data of the edge;

and loading the sub-point data of the at least one sub-point, the target sub-point, the edge of the two adjacent sub-points, the sub-point data of the other sub-points and the edge data of the edge to a local buffer (200b) of the FPGA processing module (200), so that the scheduling module (500) can calculate at least one irregularity parameter of the graph data locally in a manner of saving communication time and writing time.

4. Optimization method according to one of the preceding claims, characterized in that said scheduling module (500) compares said at least one irregularity parameter reading with its corresponding threshold value, said threshold value being a characterizing parameter fitting the processing of said CPU processing module (100),

if at least one of the irregularity parameters exceeds its corresponding threshold, the scheduling module (500) transmits its corresponding graph data to the FPGA internal buffer (200a) and/or on-chip memory (200c) for the FPGA processing module (200) to be able to perform parallel graph computation on the graph data;

if all the irregularity parameters do not exceed the corresponding threshold values, the scheduling module (500) remotely transmits the active points of the graph data to a CPU internal buffer on the CPU processing module (100) for the CPU processing module (100) to read the graph data from the shared memory module (300) based on the active points and perform graph calculation.

5. The optimization method according to one of the preceding claims, wherein the shared memory module (300) is communicatively connected to the at least one thread via the on-chip memory (200c), so that during graph traversal of graph data of a previous iteration by the FPGA processing module (200) and/or during graph computation of graph data of a previous iteration by the FPGA processing module (200), the on-chip memory (200c) can read graph data to be processed of a next layer, so that graph traversal of graph data to be processed of a next layer can be directly performed after computation of graph data of a previous iteration by the FPGA processing module (200) is completed and/or during graph computation of graph data of a previous iteration by the CPU processing module (100) without waiting for the shared memory module (300) to transmit graph data to be processed of a next layer to the FPGA processing module (200) before the shared memory module (300) transmits graph data to be processed of a next layer to the FPGA processing module (200) And the next layer of graph data needing to be processed is subjected to graph traversal.

6. The optimization method according to one of the preceding claims, wherein at least one thread in the FPGA processing module (200) is configured to count the number of adjacent sub-points of each sub-point and generate a number set, and the thread is configured to determine whether the structure of the graph data is regular based on the number set according to at least one of the following manners:

calculating the ratio of the maximum number to the minimum number;

calculating the ratio of the maximum number to the average number;

calculating the ratio of the minimum number to the average number;

calculating the ratio of the minimum number to the median of the number;

the ratio between the maximum number and the median of the number is calculated.

7. Optimization method according to one of the preceding claims, wherein the FPGA module (200) calculates the average time of data access of each node according to the number of data access times of the previous layer iteration and the total access time, so as to determine the change of the access regularity of the graph application in the calculation process, and to calculate the access regularity of the next iteration for the scheduling module (500) to determine whether to schedule the graph application to the CPU for operation or leave the local FPGA data flow accelerator for operation.

8. A graph calculation optimization system based on heterogeneous FPGA data flow, in particular to a CPU + FPGA heterogeneous data flow graph calculation optimization system, which can respectively send graph data to a CPU processing module (100) and an FPGA processing module (200) in a scheduling mode for dynamic acceleration processing based on the power law distribution characteristics of a graph, so that the FPGA processing module (200) can cooperate with the CPU processing module (100) to process irregular data flow in the graph data under the condition of balancing the processing load between the CPU processing module (100) and the FPGA processing module (200) caused by the irregularity of the graph data to obtain a graph calculation result;

the system comprises:

shared memory module (300): the graph data is used for storing graph data to be subjected to graph calculation processing;

the CPU processing module (100);

the FPGA processing module (200); the CPU processing module (100) and the FPGA processing module (200) are in communication connection with the shared storage module (300) in a parallel mode respectively, so that the CPU processing module (100) and the FPGA processing module (200) can read at least one part of the graph data in real time respectively, and the CPU processing module (100) and/or the FPGA processing module (200) perform graph calculation on at least one part of the graph data read respectively so as to obtain a CPU graph calculation result and/or an FPGA graph calculation result respectively; and

an integration module (400) for integrating the CPU graph calculation result and/or the FPGA graph calculation result to obtain a graph calculation result,

it is characterized in that the preparation method is characterized in that,

9. The optimization system according to claim 8, characterized in that the scheduling modules (500) are communicatively connected to each other in such a way as to be able to perform local accesses with the FPGA processing module (200) having at least two threads that can be processed in parallel, so that the FPGA processing module (200) can write part of the graph data which is not fit for the CPU processing module (100) to the FPGA internal buffers (200a) corresponding to the at least two threads in the FPGA processing module (200) in a parallel processing mode in the process of traversing the graph data, therefore, the FPGA processing module (200) as a heterogeneous graph computation accelerator of the CPU processing module (100) can perform parallel graph computation on a part of graph data which is not fit for the CPU processing module (100) to process in a mode of saving communication time and writing time.

10. A graph data scheduling method based on a CPU + FPGA heterogeneous system can shunt irregular data flow of graph data by means of editability of an FPGA according to at least one parameter in the irregular data flow so as to exert respective architecture advantages of an FPGA processing module (200) and a CPU processing module (100) to cooperatively accelerate the processing of the graph data,

it is characterized in that the preparation method is characterized in that,

the FPGA processing module (200) reads graph data stored in a shared memory module (300) and traverses the graph data by means of editability of the FPGA processing module (200) based on power-law distribution characteristics of the graph data to obtain at least one irregularity parameter of an irregular data stream present in the graph data before graph computation on the graph data,

the scheduling module (500) allocates a part of the graph data fitting the CPU processing module (100) to the CPU processing module (100) for graph calculation so as to obtain the CPU graph calculation result and allocates the other remaining part to the FPGA processing module (200) with the parallel processing function for graph calculation so as to obtain the FPGA graph calculation result in a mode of matching the at least one irregular shape parameter with a preset access rule fitting the irregular shape parameter.

Technical Field

The invention relates to the technical field of graph computation, in particular to a graph computation optimization method based on heterogeneous FPGA (field programmable gate array) data streams.

Background

Graph computing (Graph Processing) is gaining increasing attention in academia and industry due to its wide applicability and excellent utility. For example, graph computing is applied in the fields of social media, molecular structure relationships in science, advertisement recommendations for e-commerce platforms, web page information, and the like. The Graph (Graph) is an abstract data structure for expressing association relations between objects, which can convert all relations between people, products, ideas, facts and interests into information streams or data streams for storage and calculation. Graphs can also be used for data mining and machine learning, such as identifying influential people and information, community discovery, finding product and advertising users, modeling dependent complex data, and the like. In the graph structure, it is described using vertices (Vertex) representing objects and edges (Edge) representing relationships between objects. Data that can be abstracted into a Graph description is Graph Data (Graph Data). The graph calculation is the process of expressing and solving the problem by taking the graph as a data model. Representative graph computers include Pregel by Google, standalone Graphchi, Cucha specific to CPU, and GraphGen specific to FPGA. Compared with the first three processing environments, the FPGA-based graph computation research is slow to develop. The FPGA is used as a semi-custom circuit, and solves the problem that an Application Specific Integrated Circuit (ASIC) is difficult to customize. But because it adds customizable features, it consumes much resources in that it is customizable and requires coordination of clocks of the various modules, resulting in a critical path time that is too long, and therefore the frequency of the FPGA as a whole is a step lower than that of the CPU and ASIC. Second, although FPGAs have a high degree of parallelism (because they can customize multiple pipelines) compared to CPUs, today's graph computation does not take good advantage of the highly parallel nature of FPGAs. For the two reasons mentioned above, FPGA graph computation has not yet achieved high efficiency today.

The existing graph computing system based on FPGA design mainly has two implementation modes, namely a CPU-FPGA based heterogeneous implementation mode and an FPGA based graph architecture implementation mode.

The heterogeneous implementation mode based on the CPU-FPGA is mainly based on different characteristics of two processing of the CPU and the FPGA, wherein the CPU is a general processor and can realize low parallelization with less thread number, but has very high operation frequency, very good instruction design and very high-efficiency program execution, the FPGA has rich logic resources and can realize fine-grained pipeline parallelism, so that high parallelization with thousands of thread numbers can be realized, therefore, if the parallelization with low parallelism needs to be realized, the CPU is superior to the FPGA, if the parallelization with high parallelism needs to be realized, the FPGA is superior to the CPU, therefore, the starting/ending stage is generally arranged in the CPU for execution in the design of the CPU-FPGA heterogeneous computing system, and the middle stage with high parallelism is arranged in the FPGA for execution. Therefore, the heterogeneous CPU-FPGA graph calculation architecture has the following advantages: (1) the fields of CPU and FPGA processing are different, FPGA is good at processing a large number of parallel applications, and the parallelism of CPU is not enough. In the graph calculation process, due to the power law characteristic of the graph, a few points have extremely large degrees, the processing of the graph data can be efficiently carried out only by extremely high parallelism, and most points have only small degrees and have great advantages on a CPU. The heterogeneous CPU-FPGA frame combined with the advantages of the two has a good scheme (2) for solving the problem of graph calculation, the FPGA has a large bandwidth, and the defect of data reading of the CPU can be overcome. In graph computation, the time for accessing the memory is much longer than the computation time, and the traditional CPU processor does not have such large bandwidth to support graph application. The FPGA usually provides a multi-port large-bandwidth DRAM for reading data, so that the FPGA has great advantages in reading image data compared with a CPU (central processing unit), and the CPU and the FPGA are combined to be an optimal solution for solving the problem of insufficient bandwidth in the image application process.

For example, chinese patent publication No. CN107844433A discloses a heterogeneous hybrid memory server architecture. It includes: a CPU computing board and an NYM board; the CPU computing board is provided with a CPU chip which is connected with a DRAM chip; the NVM board is provided with a main FPGA chip which is connected with a DRAM chip and an NVM memory bank; the CPU chip is connected with the main FPGA chip through a QPI bus; and the main FPGA chip maintains the global cache consistency of the nonvolatile memory and realizes global memory sharing. The framework constructs a heterogeneous hybrid memory system with high capacity and low power consumption in a mode that an NVM with low power consumption and high capacity is used as a far-end memory and a DRAM with low capacity and high speed is used as a near-end memory; and uniformly addressing the future memories, solving the problems of coupling and speed matching of a heterogeneous memory system and maintaining the global data consistency.

For example, chinese patent publication No. CN109471732A discloses a data allocation method for a CPU-FPGA oriented heterogeneous multi-core system. Compiling the source code into intermediate code of a low-level virtual machine LLVM through a Clang front end; executing the intermediate code of the low-level virtual machine LLVM by using the low-level virtual machine LLVM, and receiving input data so as to obtain a track of data access and an instruction track; generating a dynamic data dependency graph DDDG through an instruction track to represent a control flow and a data flow of an FPGA kernel; sending the obtained data access track into a cache simulator CacheSimulator, and acquiring a cache conflict graph CCG; and constructing an integer linear programming formula, and solving the integer linear programming formula according to the dynamic data dependency graph DDDG and the cache conflict graph CCG to obtain an optimal data distribution scheme.

For example, chinese patent publication No. CN107273331A discloses a heterogeneous computing system and method based on architecture of CPU + GPU + FPGA. The computing system comprises a CPU host module, one or more GPU heterogeneous acceleration modules and one or more FPGA heterogeneous acceleration modules, wherein the CPU host module is respectively in communication connection with the GPU heterogeneous acceleration modules and the FPGA heterogeneous acceleration modules; the CPU host module is used for resource management and allocating processing tasks to the GPU heterogeneous acceleration module and/or the FPGA heterogeneous acceleration module. The GPU heterogeneous acceleration module carries out parallel processing on tasks from the CPU host module; and the FPGA heterogeneous acceleration module performs serial or parallel processing on tasks from the CPU host module. The heterogeneous computing system provided by the patent can exert the management and control advantages of a CPU, the parallel processing advantages of a GPU and the performance-power consumption ratio of an FPGA, has the advantage of flexible configuration, can adapt to different application scenes, and meets different types of task requirements.

For example, chinese patent publication No. CN108596885A discloses a fast SAR image change monitoring method based on a CPU + FPGA. The method adopts the CPU + FPGA heterogeneous parallel acceleration to realize the change detection of the SAR image, and the parallel computing capability of the FPGA is far higher than the serial computing capability of the CPU data. And the FPGA is used for parallelizing the data processing in the whole algorithm, and the CPU is used for controlling the scheduling of tasks and the size of parameters at the host end. The SAR image change detection algorithm is written in a parallel mode by adopting the parallel programming standard of OpenCL, the portability of codes is high, and the SAR image change detection algorithm can be operated for 5 times on different heterogeneous platforms.

Chinese patent publication No. CN101165652 discloses a method for loading multiple FPGAs simultaneously by using a CPU. A system for including a single or multi-chip FPGA and a CPU, comprising: step one, generating an FPGA loading file by merging loading bit stream files of a plurality of FPGAs; performing logic expansion on an address bus of the CPU to obtain a register and a latch required for loading the multiple pieces of FPGA; and step three, the CPU reads the FPGA loading file to an extended memory of the CPU and loads the multiple FPGAs or one FPGA in the multiple FPGAs by controlling the register and the latch to generate a loading time sequence. Compared with the prior art, the method greatly accelerates the loading speed of the CPU to the plurality of FPGAs, improves the reliability and flexibility of the loading circuit, and effectively saves the capacity of the external FLASH.

For example, a heterogeneous computing system based on a CPU + FPGA disclosed in chinese patent publication No. CN108776649A relates to the technical field of heterogeneous computing, and multiple FPGA acceleration modules cooperate with a CPU main control module to complete the same computing task; the CPU main control module is responsible for logic judgment, management and control and carries out calculation task distribution on the FPGA acceleration module; the FPGA acceleration module is responsible for accelerating the calculation task; the interior of the FPGA acceleration module is divided into static areas: the system is responsible for realizing PCIeDMA communication, SRIO communication and DDR control; and a dynamic reconfigurable area: and executing a kernel function issued by the CPU main control module to accelerate the calculation task. The invention realizes the parallel acceleration processing and the running water acceleration processing of the computing task aiming at different task types, can greatly improve the throughput rate of the task processing, shorten the execution time of the task and greatly improve the computing performance of the computer

However, the whole graph computation process is executed on two design architectures, namely a CPU and an FPGA, and the advantages of the two are combined with each other compared with a graph computation architecture based on a single FPGA or a single CPU, but the following disadvantages also exist: (1) when the CPU and the FPGA execute conversion, the program execution of the current processor needs to be stopped, the current running state and the processing data are stored, then the data are transmitted to the processor needing to be converted through a PCIe channel, and then the execution state is reconstructed, wherein a large amount of time is spent in the process, and the time is almost more than 20% of the total program execution time; (2) when the CPU converts the program execution into the FPGA, the work of the CPU needs to be stopped, and when the FPGA converts the program execution into the CPU, the work of the FPGA needs to be stopped.

Furthermore, on the one hand, due to the differences in understanding to the person skilled in the art; on the other hand, since the inventor has studied a lot of documents and patents when making the present invention, but the space is not limited to the details and contents listed in the above, however, the present invention is by no means free of the features of the prior art, but the present invention has been provided with all the features of the prior art, and the applicant reserves the right to increase the related prior art in the background.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a graph calculation optimization method based on heterogeneous FPGA (field programmable gate array) data flow, in particular to a data flow graph calculation optimization method based on CPU + FPGA heterogeneity, which can respectively send graph data to a CPU processing module and an FPGA processing module for dynamic acceleration processing in a scheduling mode based on the power law distribution characteristics of a graph, so that the FPGA processing module can cooperate with the CPU processing module to process irregular data flow in the graph data to obtain a graph calculation result under the condition of balancing the processing load between the CPU processing module and the FPGA processing module caused by the irregularity of the graph data as much as possible; the method comprises the following steps: the shared storage module stores graph data to be subjected to graph calculation processing; the CPU processing module and the FPGA processing module are respectively in communication connection with the shared storage module in a parallel mode, so that the CPU processing module and the FPGA processing module can respectively read at least one part of the graph data in real time, and the CPU processing module and/or the FPGA processing module perform graph calculation on at least one part of the graph data read respectively so as to respectively obtain a CPU graph calculation result and/or an FPGA graph calculation result; the integration module integrates the CPU graph calculation result and/or the FPGA graph calculation result, to obtain the graph calculation result, the FPGA processing module reads the graph data stored in the shared storage module, and traversing the graph data by means of editability of the FPGA processing module based on power-law distribution characteristics of the graph data to obtain at least one irregularity parameter of an irregular data stream present in the graph data before graph computation on the graph data, and the scheduling module allocates a part of the graph data which is fit with the CPU processing module to the CPU processing module for graph calculation to obtain a CPU graph calculation result and allocates the other remaining part to the FPGA processing module with a parallel processing function for graph calculation to obtain an FPGA graph calculation result in a mode of matching the at least one irregular shape parameter with a preset access rule which is fit with the at least one irregular shape parameter.

The invention realizes an efficient heterogeneous data flow graph calculation execution framework based on instruction regularity in graph data, which specifically comprises the following contents: the execution model is divided into two parts for realizing efficient graph calculation, firstly, a first part of FPGA processing modules calculate the instruction regularity in the graph data calculation process, calculate the control regularity and the access regularity and quantize the control regularity and the access regularity to serve as the basis of a subsequent scheduling algorithm. The second part is a graph scheduling algorithm, and the graph scheduling algorithm is used for dynamically scheduling graph data suitable for a traditional processor and graph data suitable for a data stream according to the control regularity and the access regularity calculated by the first part and according to the difference between the graph data suitable for the traditional processor and the graph data suitable for the data stream, so that the purpose of accelerating graph calculation is achieved. Compared with the traditional graph calculation method, the invention introduces a data stream processing model to process the part which is not good for processing by the traditional CPU, thereby solving the problem of low-efficiency instruction level parallelism caused by irregular access of the graph and increasing the calculation efficiency.

According to a preferred embodiment, the scheduling modules are communicatively connected to each other in such a way that local access can be performed with the FPGA processing module having at least two threads that can be processed in parallel, so that the FPGA processing module writes a part of the graph data which is not fit for the CPU processing module to the FPGA internal buffers corresponding to the at least two threads in the FPGA processing module in a parallel processing mode in the process of traversing the graph data, so that the FPGA processing module as the heterogeneous graph computation accelerator of the CPU processing module can perform parallel graph computation on a part of graph data which is not fit for the CPU processing module to process in a mode of saving communication time and writing time, to balance as much as possible the processing load of the CPU processing module and the FPGA processing module due to irregularities of the map data.

According to a preferred embodiment, at least one thread in the FPGA processing module is configured to be able to traverse the graph data according to the traversal rule based on editability of the FPGA processing module, and the at least one thread is communicatively connected to the scheduling module through a local cache, and the traversing of the at least one thread at least includes: selecting at least one sub-point from the graph data as a starting point to start traversing; reading the sub-point data of the at least one sub-point; reading the adjacent target sub-points and the edges of the two adjacent sub-points according to the structure in the graph data; reading sub-point data of the target sub-point and edge data of the edge; and loading the sub-point data of the at least one sub-point, the target sub-point, the edge of the two adjacent sub-points, the sub-point data of the other sub-points and the edge data of the edge to a local buffer of the FPGA processing module, so that the scheduling module can calculate the irregularity parameter of the graph data locally in a manner of saving communication time and writing time.

According to a preferred embodiment, the scheduling module compares the at least one irregularity parameter reading with its corresponding threshold, where the threshold is a characterization parameter matching the processing of the CPU processing module, and if at least one of the irregularity parameters exceeds its corresponding threshold, the scheduling module transmits its corresponding graph data to the FPGA internal buffer and/or on-chip memory, so that the FPGA processing module can perform parallel graph computation on the graph data; and if all the irregularity parameters do not exceed the corresponding threshold values, the scheduling module remotely transmits the active points of the graph data to a CPU internal buffer on the CPU processing module, so that the CPU processing module reads the graph data from the shared storage module based on the active points and performs graph calculation.

According to a preferred embodiment, the shared storage module is communicatively connected to the on-chip memory, so that the on-chip memory can read the next layer of graph data to be processed in the process of graph traversal of the FPGA processing module on the graph data of the previous layer of iteration and/or in the process of graph calculation of the FPGA processing module on the graph data of the previous layer of iteration, and thus the on-chip memory can directly perform graph traversal of the next layer of graph data to be processed after the FPGA processing module completes the graph calculation of the graph data of the previous layer of iteration and/or in the process of graph calculation of the CPU processing module on the graph data of the previous layer of iteration without waiting for the shared storage module to transmit the next layer of graph data to be processed to the FPGA processing module.

According to a preferred embodiment, at least one thread in the FPGA processing module is configured to count the number of adjacent sub-points of each sub-point and generate a number set, and the thread can determine whether the structure of the graph data is regular based on the number set according to at least one of the following manners: calculating the ratio of the maximum number to the minimum number; calculating the ratio of the maximum number to the average number; calculating the ratio of the minimum number to the average number; calculating the ratio of the minimum number to the median of the number; the ratio between the maximum number and the median of the number is calculated.

According to a preferred embodiment, the FPGA module calculates the average time of data access of each node according to the number of data access times of the previous layer iteration and the total access time, so as to determine the change of the access regularity of the graph application in the calculation process, and calculate the access regularity of the next iteration to be used by the scheduling module to determine whether to schedule the graph application to the CPU for operation or to remain in the local FPGA data flow accelerator for operation.

According to a preferred embodiment, the invention further provides a graph computation optimization system based on heterogeneous FPGA data flow, and particularly relates to a data flow graph computation optimization system based on CPU + FPGA heterogeneity, which can send graph data to a CPU processing module and an FPGA processing module respectively in a scheduling manner for dynamic acceleration processing based on power-law distribution characteristics of a graph, so that the FPGA processing module can cooperate with the CPU processing module to process irregular data flow in the graph data under the condition of balancing processing loads between the CPU processing module and the FPGA processing module caused by irregularities of the graph data, so as to obtain a graph computation result; the system comprises: sharing the storage module: the graph data is used for storing graph data to be subjected to graph calculation processing; the CPU processing module; the FPGA processing module; the CPU processing module and the FPGA processing module are respectively in communication connection with the shared storage module in a parallel mode, so that the CPU processing module and the FPGA processing module can respectively read at least one part of the graph data in real time, and the CPU processing module and/or the FPGA processing module perform graph calculation on at least one part of the graph data read respectively so as to respectively obtain a CPU graph calculation result and/or an FPGA graph calculation result; and an integration module for integrating the CPU graph calculation result and/or the FPGA graph calculation result, to obtain graph calculation results, the FPGA processing module reads the graph data stored in the shared storage module, and traversing the graph data by means of editability of the FPGA processing module based on power-law distribution characteristics of the graph data to obtain at least one irregularity parameter of an irregular data stream present in the graph data before graph computation on the graph data, and the scheduling module allocates a part of the graph data which is fit with the CPU processing module to the CPU processing module for graph calculation to obtain a CPU graph calculation result and allocates the other remaining part to the FPGA processing module with a parallel processing function for graph calculation to obtain an FPGA graph calculation result in a mode of matching the at least one irregular shape parameter with a preset access rule which is fit with the at least one irregular shape parameter.

According to a preferred embodiment, in the system, the scheduling modules are communicatively connected to each other in such a way as to be able to perform local access with the FPGA processing module having at least two threads that can be processed in parallel, so that the FPGA processing module can write part of the graph data which is not fit for the processing of the CPU processing module into FPGA internal buffers corresponding to the at least two threads in the FPGA processing module in a parallel processing mode in the process of traversing the graph data, so that the FPGA processing module as the heterogeneous graph computation accelerator of the CPU processing module can perform parallel graph computation on a part of graph data which is not fit for the CPU processing module to process in a mode of saving communication time and writing time, to balance as much as possible the processing load between the CPU processing module and the FPGA processing module due to irregularities in the map data.

According to a preferred embodiment, the present invention further provides a graph data scheduling method based on a CPU + FPGA heterogeneous system, which can shunt an irregular data stream of graph data according to at least one parameter in the irregular data stream by means of editability of an FPGA, so as to exert respective architectural advantages of an FPGA processing module and a CPU processing module to cooperatively accelerate processing of the graph data, wherein the FPGA processing module reads the graph data stored in a shared storage module, and traverses the graph data by means of editability of the FPGA processing module based on a power law distribution characteristic of the graph data before graph computation is performed on the graph data to obtain at least one irregularity parameter of the irregular data stream existing in the graph data, and the scheduling module allocates a part of the graph data fitting the CPU processing module to a preset access rule fitting the at least one irregularity parameter and the at least one irregularity parameter in a matching manner And the CPU processing module performs graph calculation to obtain the CPU graph calculation result and distributes the other remaining part to the FPGA processing module with the parallel processing function to perform graph calculation to obtain the FPGA graph calculation result.

The invention provides a heterogeneous implementation mode based on a CPU-FPGA (Central processing Unit-field programmable Gate array), which is mainly based on different characteristics of two processing of the CPU and the FPGA, wherein the CPU is a general processor and can realize low parallelization with less thread number, but has very high operation frequency, very good instruction design and very high-efficiency instruction execution, and the FPGA has rich logic resources and can realize fine-grained pipeline parallelization, so that high parallelization with thousands of thread numbers can be realized. Therefore, the heterogeneous CPU-FPGA graph calculation architecture has the following advantages: (1) the fields of CPU and FPGA processing are different, FPGA is good at processing a large number of parallel applications, and the parallelism of CPU is not enough. In the graph calculation process, due to the power law characteristic of the graph, a few points have extremely large degrees, the processing of the graph data can be efficiently carried out only by extremely high parallelism, and most points have only small degrees and have great advantages on a CPU. The heterogeneous CPU-FPGA frame combined with the advantages of the two has a good scheme (2) for solving the problem of graph calculation, the FPGA has a large bandwidth, and the defect of data reading of the CPU can be overcome. In graph computation, the time for accessing the memory is much longer than the computation time, and the traditional CPU processor does not have such large bandwidth to support graph application. The FPGA usually provides a multi-port large-bandwidth DRAM for reading data, so that the FPGA has great advantages in reading image data compared with a CPU (central processing unit), and the CPU and the FPGA are combined to be an optimal solution for solving the problem of insufficient bandwidth in the image application process.

Drawings

FIG. 1 is a schematic diagram of a logical structure of an optimization system provided by the present invention; and

fig. 2 is a schematic diagram of a preferred logic structure of an FPGA module provided in the present invention.

List of reference numerals

100: the CPU processing module 500: scheduling module

200: FPGA processing module 200 a: internal buffer

300: shared memory module 200 b: local buffer

400: the integration module 200 c: on-chip memory

Detailed Description

This is described in detail below with reference to fig. 1-2.

Interpretation of terms

A CPU: central Processing Unit-Central Processing Unit;

FPGA: field programmable Gate Array-Field programmable Gate Array;

DRAM: dynamic random access memory

BRAM: block random access memory

Graph-Graph

Graph Data-Graph Data

Graph Processing-Graph computation

19页详细技术资料下载

Graph calculation optimization method based on heterogeneous FPGA data flow

相关技术

网友询问留言