Data caching implementation method facing SPM storage hierarchy

文档序号:923744 发布日期:2021-03-02 浏览:15次 中文

阅读说明:本技术 面向spm存储层次的数据缓存实现方法 (Data caching implementation method facing SPM storage hierarchy ) 是由 方燕飞 文延华 李雁冰 杨小川 王敬宇 何王全 于 2019-08-26 设计创作,主要内容包括:本发明公开一种面向SPM存储层次的数据缓存实现方法,包括以下步骤:S1、建立数据缓存的编程描述;S2、编译器分析编译指示,确定循环分块参数LoopBlockKey;S3、编译器根据S2中获得的循环分块参数,计算每一个循环块的计算、访存开销与数据传送开销的比值,得到并行隐藏参数Hkey;S4、根据S3得到的并行隐藏参数,反馈修正,得到最终的循环分块参数;S5、根据循环分块参数和并行隐藏参数,选择循环重构策略进行循环重构。本发明有效提升了编译器的上下文分析精确度和应用的访存效率,且可以做到计算与不同存储层次间的数据传送并行进行,有效利用处理器的各个存储层次,更进一步提升程序性能。(The invention discloses a data cache implementation method facing SPM storage hierarchy, which comprises the following steps: s1, establishing a programming description of the data cache; s2, the compiler analyzes the compiling instruction and determines a loop blocking parameter LoopBlockKey; s3, calculating the ratio of the calculation cost and the memory access cost of each circulation block to the data transmission cost by the compiler according to the circulation block parameters obtained in S2 to obtain a parallel hidden parameter Hkey; s4, according to the parallel hidden parameters obtained in the S3, feedback correction is carried out to obtain final circulation block parameters; and S5, selecting a circulation reconstruction strategy to carry out circulation reconstruction according to the circulation blocking parameters and the parallel hiding parameters. The invention effectively improves the context analysis accuracy of the compiler and the access efficiency of the application, can perform calculation and data transmission between different storage levels in parallel, effectively utilizes each storage level of the processor, and further improves the program performance.)

1. A data cache implementation method facing SPM storage hierarchy is characterized in that: the method comprises the following steps:

s1, establishing programming description of data caching: before circulation, adding compiling indication for explaining a data set needing memory access optimization in the circulation and information of an SPM cache region for memory access optimization;

s2, the compiler analyzes the compiling instruction in S1 and determines the loop blocking parameter LoopBlockKey: analyzing the compiling instruction in S1 to obtain the total size of the SPM cache region and the data set needing to be accessed and stored for optimization, determining the data volume transmitted in batch each time, determining the loop iteration times which can ensure that data access is concentrated in the SPM cache according to the data volume, and taking the loop iteration times as a loop block parameter LoopBlockKey;

s3, the compiler calculates the ratio of the calculation and access cost and the data transmission cost of each loop block according to the loop block parameter LoopBlockKey obtained in S2 to obtain a parallel hidden parameter Hkey, and the specific steps are as follows:

s31, obtaining iteration number It contained in each loop block after loop blocking according to the loop blocking parameter LoopBlockKey, wherein It = LoopBlockKey;

s32, calculating the calculation cost Tc of each iteration; scanning all sentences in the loop, acquiring the calculation type of each calculation in each sentence, obtaining the calculation overhead value of the calculation according to the calculation type, and accumulating all the calculation overhead values in the loop to obtain the calculation overhead Tc of each iteration;

s33, calculating the access cost Tm of each iteration; scanning all the access operations in the cycle, judging whether the access operation can be subjected to access optimization according to the data set which is determined in S2 and needs access optimization, if so, the access overhead of each iteration of the access operation is the overhead of accessing a SPM buffer space, otherwise, the access overhead of the access operation is the access overhead of a corresponding storage layer, and accumulating the access overheads of all the access operations in the cycle to obtain the access overhead Tm of each iteration;

s34, calculating the data transmission overhead Tt of each circulation block; obtaining the number n of times of data transmission according to a data set which can be subjected to access optimization and is indicated by the compiling instruction in the S1, and obtaining a data volume Tsizei = LoopBlockKey TypeSizei of ith data transmission through a loop blocking parameter LoopBlockKey, where TypeSizei is a type of an array variable of ith data transmission, and a data transmission overhead Tt may be obtained by the following equation, Tt = ∑ (Tstart + tsize/Tbandwidth), i =1, n, where Tstart is a starting overhead of data transmission, and Tbandwidth is a bandwidth of data transmission;

s35, calculating a parallel hidden parameter Hkey of each circulation block; hkey = (It × Tc + It × Tm)/Tt;

s4, determining the final circulation block parameters of the block circulation according to the parallel hidden parameters Hkey calculated in S3; if Hkey is greater than or equal to 1, it indicates that the calculation and access overheads are greater than or equal to the data transmission overheads, and the transmission of data between different storage layer orders can be hidden by calculation and access operations, i.e. when the loop block operation is performed this time, the batch transmission of data required by the next loop block can be performed in parallel, in this case, the SPM buffer needs to be divided into two parts, one part is used for the operation of the loop block, and the other part is used for the pre-transmission of data required by the next loop block, so the loop block dividing parameter LoopBlockKey is halved; if Hkey is smaller than 1, not doing the parallel of calculation, access and data transmission, and keeping the loop blocking parameter LoopBlockKey unchanged;

s5, selecting a loop reconstruction strategy to carry out loop reconstruction according to the loop partitioning parameter LoopBlockKey obtained in S4; when the parallel hidden parameter calculated by the S3 is less than 1, the calculation and access overhead can not hide the data transmission overhead, and the circular reconstruction of a sequential execution mode is adopted, namely the calculation and the data transmission sequence between different storage layers are carried out, in the process, a compiler converts a plurality of continuous main memory access operations into batch data transmission finished by using a data transmission engine between an SPM and a main memory and a plurality of accesses to an SPM buffer space, and carries out access optimization; and when the parallel hiding parameter is greater than or equal to 1, adopting the cyclic reconstruction of a parallel execution mode, carrying out data transmission hiding by using two buffers, alternately using the two buffers, and carrying out asynchronous data transmission.

2. The SPM-storage-tier-oriented data cache implementation method as in claim 1, wherein: if multiple loops need to be optimized, a compile indication is added before each loop that needs to be optimized.

3. The SPM-storage-tier-oriented data cache implementation method as in claim 1, wherein: the information of the SPM cache area comprises a cache area first address on the SPM and the size of the cache area.

Technical Field

The invention relates to a data cache implementation method facing SPM storage hierarchy, belonging to the technical field of computers.

Background

With the development of microprocessor architectures, multi-core and many-core become the main direction of microprocessor design. Due to area and power consumption limitations, many-core processors tend to organize on-chip SRAMs into software-managed non-cache structures such as SPMs. The characteristic of the need of software management of the storage layers brings great challenges to user programming, and when the user writes and writes, the user not only needs to perform multi-level and multi-level parallel programming, but also needs to consider the layout, buffering and transmission of data in different storage layers.

The many-core processor supports multi-level and multi-level parallelism, the storage architecture is complex, programming is difficult, and in order to solve the programming problem of the many-core processor, a great deal of deep research is developed in the academic world and the industry, and various solutions are provided. The method mainly comprises the following steps: 1) a parallel language supporting explicit heterogeneous programming represented by the OpenCL language. OpenCL provides rich language extensions, supports data on multiple storage levels, and supports explicit data transfer between different storage levels. The rich language support of OpenCL is a good tool for compiling flexible and efficient applications and is a main factor for restricting OpenCL users. As a language to provide rich APIs to support explicit parallelism, OpenCL application performance is mainly obtained by relying on user explicitly specified data layout and data transfer, and user-determined parallelism. This requires OpenCL users to not only have rich professional application knowledge, but also have an in-depth understanding of the architecture of many-core processors. 2) And the programming languages represented by OpenMP and OpenAcc support implicit parallel. With this type of language programming, the user does not need to perform storage management of the display, but only needs to specify the parallel interval and the parallel mode with the compiling instruction. The performance of this class of applications depends to a large extent on the optimization of the compilation system. The precondition of compiling optimization of the compiling system is to ensure the correctness of the application. Therefore, the compiling system can only perform conservative processing under the condition that the context cannot be clear, and cannot perform radical performance optimization, which will certainly affect the efficiency of the application program.

Wuhan science and technology university applied for a patent in 11 months of 2012 (CN 102890658A, an input and output buffer implementation method based on SPM in an embedded system), which is mainly used for improving the input and output processing speed of the embedded system and reducing the storage access power consumption in the input and output processing process. The method is characterized in that a plurality of input and output request areas are divided in an SPM space and used as input and output buffers, and data are imported into a memory space from the SPM space after the buffers overflow. The patent utilizes the characteristics of small access delay and high access bandwidth of the SPM space, and the input and output of the memory space are buffered to the SPM space, so that the memory access performance is improved. However, the above patent solves the problem of buffering input and output, and the use of SPM space to improve the access performance is far from how to implement buffering of data access in a loop and parallel operation and data transmission, so as to achieve the purpose of improving the access performance.

Many scientific research institutions also try to program and compile optimization design of the many-core processor oriented to the SPM storage level, the current main processing methods are divided into two categories, the first category provides a detailed API interface from the perspective of excavator performance, parallel design of application and layout, import and export of data are completely handed to users for processing, and application performance is improved to the maximum extent. The disadvantage of this type of method is that it is highly demanding for the user. In the second category, from the viewpoint of reducing the burden of the user, the user only needs to make a parallel policy, and the layout, import and export of data are all automatically processed by a compiler. The method has the disadvantages that the application performance depends heavily on the optimization design of the compiler, and the application performance is influenced greatly under the condition that some compilers cannot be subjected to fine optimization. Therefore, how to balance the two methods and find a method for guiding a compiler to perform data layout and data import and export processing becomes a difficult problem of programming and compiling optimization design of the SPM storage level-oriented many-core processor.

Disclosure of Invention

The invention aims to provide a data cache implementation method facing an SPM storage level, which effectively improves the context analysis accuracy of a compiler and the access efficiency of application, can perform calculation and data transmission among different storage levels in parallel, effectively utilizes each storage level of a processor, and further improves the program performance.

In order to achieve the purpose, the invention adopts the technical scheme that: a data cache implementation method facing SPM storage hierarchy includes the following steps:

s1, establishing programming description of data caching: before circulation, adding compiling indication for explaining a data set needing memory access optimization in the circulation and information of an SPM cache region for memory access optimization;

s2, the compiler analyzes the compiling instruction in S1 and determines the loop blocking parameter LoopBlockKey: analyzing the compiling instruction in S1 to obtain the total size of the SPM cache region and the data set needing to be accessed and stored for optimization, determining the data volume transmitted in batch each time, determining the loop iteration times which can ensure that data access is concentrated in the SPM cache according to the data volume, and taking the loop iteration times as a loop block parameter LoopBlockKey;

s3, the compiler calculates the ratio of the calculation and access cost and the data transmission cost of each loop block according to the loop block parameter LoopBlockKey obtained in S2 to obtain a parallel hidden parameter Hkey, and the specific steps are as follows:

s31, obtaining iteration number It contained in each loop block after loop blocking according to the loop blocking parameter LoopBlockKey, wherein It = LoopBlockKey;

s32, calculating the calculation cost Tc of each iteration; scanning all sentences in the loop, acquiring the calculation type of each calculation in each sentence, obtaining the calculation overhead value of the calculation according to the calculation type, and accumulating all the calculation overhead values in the loop to obtain the calculation overhead Tc of each iteration;

s33, calculating the access cost Tm of each iteration; scanning all the access operations in the cycle, judging whether the access operation can be subjected to access optimization according to the data set which is determined in S2 and needs access optimization, if so, the access overhead of each iteration of the access operation is the overhead of accessing a SPM buffer space, otherwise, the access overhead of the access operation is the access overhead of a corresponding storage layer, and accumulating the access overheads of all the access operations in the cycle to obtain the access overhead Tm of each iteration;

s34, calculating the data transmission overhead Tt of each circulation block; obtaining the number n of times of data transmission according to a data set which can be subjected to access optimization and is indicated by the compiling instruction in the S1, and obtaining a data volume Tsizei = LoopBlockKey TypeSizei of ith data transmission through a loop blocking parameter LoopBlockKey, where TypeSizei is a type of an array variable of ith data transmission, and a data transmission overhead Tt may be obtained by the following equation, Tt = ∑ (Tstart + tsize/Tbandwidth), i =1, n, where Tstart is a starting overhead of data transmission, and Tbandwidth is a bandwidth of data transmission;

s35, calculating a parallel hidden parameter Hkey of each circulation block; hkey = (It × Tc + It × Tm)/Tt;

s4, determining the final circulation block parameters of the block circulation according to the parallel hidden parameters Hkey calculated in S3; if Hkey is greater than or equal to 1, it indicates that the calculation and access overheads are greater than or equal to the data transmission overheads, and the transmission of data between different storage layer orders can be hidden by calculation and access operations, i.e. when the loop block operation is performed this time, the batch transmission of data required by the next loop block can be performed in parallel, in this case, the SPM buffer needs to be divided into two parts, one part is used for the operation of the loop block, and the other part is used for the pre-transmission of data required by the next loop block, so the loop block dividing parameter LoopBlockKey is halved; if Hkey is smaller than 1, not doing the parallel of calculation, access and data transmission, and keeping the loop blocking parameter LoopBlockKey unchanged;

s5, selecting a loop reconstruction strategy to carry out loop reconstruction according to the loop partitioning parameter LoopBlockKey obtained in S4; when the parallel hidden parameter calculated by the S3 is less than 1, the calculation and access overhead can not hide the data transmission overhead, and the circular reconstruction of a sequential execution mode is adopted, namely the calculation and the data transmission sequence between different storage layers are carried out, in the process, a compiler converts a plurality of continuous main memory access operations into batch data transmission finished by using a data transmission engine between an SPM and a main memory and a plurality of accesses to an SPM buffer space, and carries out access optimization; and when the parallel hiding parameter is greater than or equal to 1, adopting the cyclic reconstruction of a parallel execution mode, carrying out data transmission hiding by using two buffers, alternately using the two buffers, and carrying out asynchronous data transmission.

The further improved scheme in the technical scheme is as follows:

1. in the above scheme, if there are multiple loops that need to be optimized, a compile instruction is added before each loop that needs to be optimized.

2. In the above scheme, the information of the SPM cache area includes a cache area first address on the SPM and a size of the cache area.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the invention relates to a data cache implementation method facing an SPM storage level, which is mainly applied to a parallel language compiling system, wherein a compiler is guided by compiling instructions to complete the importing and exporting of data on the SPM storage level, the burden of a user is lightened, and simultaneously the compiling guide information is fully utilized to guide the compiler to purposefully carry out high-efficiency access optimization, and the program performance is not required to be rapidly reduced by adopting a conservative compiling strategy under the condition that the context cannot be accurately analyzed, the compiler can carry out fine and high-efficiency optimization according to the compiling instructions, so that the context analysis accuracy of the compiler and the access efficiency of application are effectively improved; in addition, on the basis of analyzing the calculation overhead, the access overhead and the data transmission overhead of the loop, the compiler can optimize and hide partial data transmission time according to specific conditions, so that the calculation and the data transmission among different storage levels are carried out in parallel, each storage level of the processor is effectively utilized, and the program performance is further improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flowchart of a loop execution of a sequential execution mode;

FIG. 3 is a flow chart of a parallel execution mode loop execution.

Detailed Description

Example (b): a data cache implementation method facing SPM storage hierarchy includes the following steps:

s1, establishing programming description of data caching: before circulation, adding compiling indication for explaining a data set needing memory access optimization in the circulation and information of an SPM cache region for memory access optimization;

s2, the compiler analyzes the compiling instruction in S1 and determines the loop blocking parameter LoopBlockKey: analyzing the compiling instruction in S1 to obtain the total size of the SPM cache region and the data set needing to be accessed and stored for optimization, determining the data volume transmitted in batch each time, determining the loop iteration times which can ensure that data access is concentrated in the SPM cache according to the data volume, and taking the loop iteration times as a loop block parameter LoopBlockKey;

s3, the compiler calculates the ratio of the calculation and access cost and the data transmission cost of each loop block according to the loop block parameter LoopBlockKey obtained in S2 to obtain a parallel hidden parameter Hkey, and the specific steps are as follows:

s31, obtaining iteration number It contained in each loop block after loop blocking according to the loop blocking parameter LoopBlockKey, wherein It = LoopBlockKey;

s32, calculating the calculation cost Tc of each iteration; scanning all sentences in the loop, acquiring the calculation type of each calculation in each sentence, obtaining the calculation overhead value of the calculation according to the calculation type, and accumulating all the calculation overhead values in the loop to obtain the calculation overhead Tc of each iteration;

s33, calculating the access cost Tm of each iteration; scanning all the access operations in the cycle, judging whether the access operation can be subjected to access optimization according to the data set which is determined in S2 and needs access optimization, if so, the access overhead of each iteration of the access operation is the overhead of accessing a SPM buffer space, otherwise, the access overhead of the access operation is the access overhead of a corresponding storage layer, and accumulating the access overheads of all the access operations in the cycle to obtain the access overhead Tm of each iteration;

s34, calculating the data transmission overhead Tt of each circulation block; obtaining the number n of times of data transmission according to a data set which can be subjected to access optimization and is indicated by the compiling instruction in the S1, and obtaining a data volume Tsizei = LoopBlockKey TypeSizei of ith data transmission through a loop blocking parameter LoopBlockKey, where TypeSizei is a type of an array variable of ith data transmission, and a data transmission overhead Tt may be obtained by the following equation, Tt = ∑ (Tstart + tsize/Tbandwidth), i =1, n, where Tstart is a starting overhead of data transmission, and Tbandwidth is a bandwidth of data transmission;

s35, calculating a parallel hidden parameter Hkey of each circulation block; hkey = (It × Tc + It × Tm)/Tt;

s4, determining the final circulation block parameters of the block circulation according to the parallel hidden parameters Hkey calculated in S3; if Hkey is greater than or equal to 1, it indicates that the computation and access overhead is greater than or equal to the data transmission overhead, and the transmission of data between different storage layer orders can be hidden by computation and access operation, i.e. when the loop block operation is performed, the batch transmission of data required by the next loop block can be performed in parallel, in this case, SPM buffer needs to be divided into two parts, one part is used for the computation of the loop block, the other part is used for the pre-transmission of data required by the next loop block, the loop block is caused by data buffer on SPM space, and the data buffer is reduced by half, so the loop block parameter LoopBlockKey is reduced by half; if Hkey is smaller than 1, not doing the parallel of calculation, access and data transmission, and keeping the loop blocking parameter LoopBlockKey unchanged;

s5, selecting a loop reconstruction strategy to carry out loop reconstruction according to the loop partitioning parameter LoopBlockKey obtained in S4; when the parallel hidden parameter calculated by the S3 is less than 1, the calculation and access overhead can not hide the data transmission overhead, and the circular reconstruction of a sequential execution mode is adopted, namely the calculation and the data transmission sequence between different storage layers are carried out, in the process, a compiler converts a plurality of continuous main memory access operations into batch data transmission finished by using a data transmission engine between an SPM and a main memory and a plurality of accesses to an SPM buffer space, and carries out access optimization; and when the parallel hiding parameter is greater than or equal to 1, adopting the cyclic reconstruction of a parallel execution mode, carrying out data transmission hiding by using two buffers, alternately using the two buffers, and carrying out asynchronous data transmission.

If multiple loops need to be optimized, a compile indication is added before each loop that needs to be optimized.

The information of the SPM cache includes a cache first address on the SPM and a size of the cache.

The examples are further explained below:

the invention aims to overcome the defects of the prior art, provides a simplified programming method of a many-core processor facing an SPM storage level, and provides a technology for improving the application memory access performance of the many-core processor based on the programming method.

In order to achieve the purpose, the technical scheme of the invention is as follows:

1) providing a SPM storage hierarchy oriented data cache programming description

2) The access optimization technology automatically performed by the compiler under the action of the programming description is provided

As will be explained in detail below.

1) Data cache programming description oriented to SPM storage hierarchy

The most basic storage architecture of the many-core processor comprises a main memory and an SPM (local performance manger), wherein the main memory has large space, high access delay and low access bandwidth; the SPM space is small, the access delay is low, and the access bandwidth is high; a batch transfer engine is arranged between the main memory and the SPM, and can perform batch data transfer at one time. The main memory space is usually GB level, and the SPM space is usually only KB level, which means that most data of the application subject can only be put in the main memory; however, the access delay of the main memory is dozens of times of that of the SPM, in order to improve the application performance, the data access in the core program segment is preferably concentrated in the SPM, so that the access optimization is required, a buffer area is set in the SPM, the main memory data required in the core program segment is transmitted to the SPM buffer in batches in advance by using a batch transmission engine, the operation of the core program segment is developed aiming at the SPM, and the updated data is transmitted back to the main memory in batches after the operation is completed.

The core program segment of an application is typically in the form of a loop. The access optimization is carried out aiming at the circulation, the granularity of batch data interaction between a main memory and an SPM buffer and the position of batch data interaction function insertion are considered, the circulation block division, variable subscript conversion and corresponding program conversion are carried out, even more complex operations such as circulation division exist, if all the operations are processed by a user, the burden of the user is heavy, and the programming efficiency is lower than the lower level. In order to reduce the burden of a user, the invention provides data cache programming description oriented to an SPM storage hierarchy.

The programming description is for a loop. The method indicates brief information before a loop structure needing access optimization in a compiling indication mode, and complex work such as granularity and form of batch data transmission, program conversion and the like is automatically completed by a compiler, so that the burden of a user can be effectively reduced. The brief information comprises a data set needing to be subjected to access optimization, a cache area first address on an SPM used for the access optimization and the size of the cache area.

The data set for access optimization refers to a variable of continuous access or region access in a loop, the continuous access comprises continuous access with a stride of 1 and a stride not equal to 1, and the region accessed by the region is necessarily smaller than the size of an SPM cache region. Data with the two access characteristics can be accessed and concentrated into an SPM cache region only after batch data is transmitted, so that the program access cost is reduced, and the effect of access optimization is achieved.

The data subjected to the access optimization can be related or unrelated. But the correctness issues caused by data dependencies are guaranteed by the compilation directives. Because the patent is characterized in that the compiling instruction guides the compiler, a user can know the relationship between variables in detail when using the compiling instruction, and the compiler directly carries out access optimization processing according to the compiling instruction without carrying out additional variable correlation analysis. This is also the main meaning of the instruction for compiling, and when the compiler cannot accurately judge the relationship between variables, it is not necessary to do conservative processing for correctness, thereby causing a sharp drop of application performance.

The cache on the SPM is illustrated by the compiled indication as a total cache size, which is also the size of the cache. If there are multiple variables in the dataset, the compiler will automatically differentiate the cache for use by the multiple variables.

The compilation indication is only for the loop immediately following, whose scope is a single-fold loop. If multiple loops are needed to be optimized, a compile instruction can be added before each loop needing optimization.

2) Access optimization technology automatically performed by compiler under action of programming description

The compiler automatically performs the access optimization according to the programming description, and the specific method is shown in fig. 1.

(1) Analyzing the compile-indication to determine loop blocking parameters

The technical idea is that the data volume transmitted in batch every time is determined according to the total size of the SPM cache region and the total number of variables needing to be accessed and optimized, the number of loop iterations which can ensure that data access is concentrated in the SPM buffer is determined according to the data volume, and the loop iteration number is used as a loop block parameter loopBlock Key to perform loop block processing on an original loop.

Determining the loop blocking parameter is determining at what granularity the loop is to be blocked. The reason for performing cyclic blocking is to perform access optimization, and because the buffer space on the SPM is limited, the original cycle needs to be executed in segments according to the size of the buffer space.

(2) Computing parallel hidden parameters

After the cyclic block parameters are determined, the access optimization can be carried out by using the buffer space on the SPM. However, it is not the final purpose that optimization of batch data transfer between the main memory and the SPM is only performed, and in order to further improve program performance, it is also possible to consider hiding part of data transfer time, and perform calculation in parallel with data transfer between different storage hierarchies.

After the loop is divided into blocks, the ratio of the calculation cost and the memory cost of each loop block to the data transmission cost is called as a parallel hidden parameter. The introduction of the parallel hiding parameter is to better determine whether the computation and access overhead of each loop block can hide the data transmission overhead. If the calculation and access overhead can be equal to the data transmission overhead, the data transmission of the next circulation block can be carried out at the same time of the calculation of the circulation block. Because only one SPM buffer space is available, to achieve data transfer and computation synchronization in parallel, the buffer space must be split into two parts, one part for computation of the already transferred data and the other part for transfer of the data used in the next cyclic block. Based on this principle, the parallel hidden parameter determination process is as follows:

a) the iteration number It contained in each loop block after the loop block is divided can be obtained according to the loop block parameter LoopBlockKey, and It = LoopBlockKey. The calculation of the loop blocking parameters does not consider the parallelism of calculation and data transmission, but the buffer space must be divided into two parts to realize the parallelism of calculation and data transmission, namely the number of elements participating in operation in the loop block is reduced by half, so the number of iterations It contained in each loop block must also be reduced by half.

b) The computational overhead Tc in each iteration is calculated. In order to accurately obtain the calculation cost, various calculation cost values of different platforms can be obtained in advance through testing. In this step, all statements in the loop need to be scanned, each calculation type is obtained, the tested calculation cost value is obtained according to the calculation type, and the calculation cost Tc of each iteration is obtained after all the calculation cost values in the loop are accumulated.

c) The memory overhead Tm in each iteration is calculated. The calculation of this overhead is for each iteration of the loop block and memory access optimization. The access cost of the memory access which can be optimized is calculated according to the cost of accessing the SPM buffer space; for the access operation which cannot be optimized, the access cost is calculated according to the access cost corresponding to the space attribute of the variable.

d) The data transfer overhead Tt is calculated. Calculating data transmission overhead of each loop block, wherein the data transmission times and data transmission amount of each loop block need to be known, the times n of data transmission need to be known by searching a variable set in a compiling instruction, obtaining the data amount Tsizei = LoopBlockKey type size of the ith data transmission through a loop blocking parameter LoopBlockKey, the type size is the type of an array variable of the ith data transmission, and the data transmission overhead Tt can be obtained through the following formula. Tstart is the start overhead of the data transfer and Tbandwidth is the bandwidth of the data transfer.

Tt=∑(Tstart+Tsizei/Tbandwidth),i=1,n

e) And calculating the parallel hidden parameters Hkey of each loop block. The formula for the parallel hidden parameters is Hkey = (It × Tc + It × Tm)/Tt. If Hkey is greater than 1, the calculation and access cost is larger than the data transmission cost, that is, the calculation and access cost can hide the transmission cost of data among different storage layers, so that the batch transmission of data required by the next circulation block can be performed in parallel when the circulation block operation is performed, and the program performance is improved; otherwise, the calculation and the memory access can not completely hide the data transmission, so that the parallel of the calculation, the memory access and the data transmission is not performed.

(3) Feedback correction of cyclic blocking parameters

After the parallel hidden parameter Hkey is determined, the final parallel mode of the block circulation is determined, and the final size of the data cache corresponding to the variable is also determined. Since the cyclic blocking is caused by the data buffer in the SPM space, the cyclic blocking parameters need to be corrected by using the determined data buffer feedback, and thus correct cyclic blocking can be performed.

(4) Cyclic reconstitution

After the cyclic block parameters are determined, a program transformation is required. The compiling instruction is directed at the loop, so the core of the program transformation is loop reconstruction, namely, one loop is changed into two loops, the outer loop explains the number of the blocked blocks, and the inner loop completes the operation task according to the size of the blocked blocks. In the process of loop reconstruction, for variables which can be subjected to access optimization, data transmission between a main memory and an SPM buffer needs to be inserted, and corresponding address conversion is carried out.

According to the difference of parallel hidden parameters, the loop reconstruction has two strategies: loop reconstruction in sequential execution mode and loop reconstruction in parallel execution mode.

When the parallel hidden parameter is less than 1, the calculation and memory access overhead can not hide the data transmission overhead, so only the cyclic reconstruction of a sequential execution mode is carried out, namely, the operation and the data transmission among different storage layers are carried out sequentially, the execution flow after the cyclic reconstruction is shown in fig. 2, firstly the leading-in transmission of the b array required by the first cyclic block is carried out, and then the operation is carried out based on the local storage lb; then, each of the second loop block, the third loop block and the fourth loop block … … is used for firstly conducting lead-in transmission of b array segments and then conducting operation based on the local memory lb; the different circulation blocks are sequentially processed. In the loop reconstruction, the optimization performed by the compiler is mainly access optimization, which converts a plurality of continuous main memory access operations into batch data transfer completed by using a data transfer engine between the SPM and the main memory and a plurality of accesses to an SPM buffer space.

And when the parallel hidden parameter is more than or equal to 1, performing loop reconstruction of the parallel execution mode. The loop reconstruction of the parallel execution mode uses two buffers for data transfer hiding, and the alternation of the two buffers and the introduction of asynchronous data transfer make the loop reconstruction more complicated. Assuming that the core loop needs to read the b array and write the a array, FIG. 3 is a flow chart of the parallel execution mode loop execution. It can be seen that from the second loop block to the (n-1) th loop block, these loop blocks have the operations of importing (main memory to SPM) and exporting (SPM to main memory) of data while operating, and can process according to a uniform mode; however, the first loop block and the last loop block are slightly different, and only lead-in or lead-out is carried out at the same time of operation. So at the time of loop reconstruction, loop segmentation needs to be introduced. The original cycle is divided and reconstructed into three sections, wherein the first section comprises data preparation and operation of a first cycle block and data preparation of a second cycle block parallel to the operation; the second section comprises a second circulation block to an n-1 circulation block, and in the section, the derivation of the result of the previous circulation block and the preparation of the data of the next circulation block are performed in parallel while the circulation block is operated; the third stage is the operation and result derivation of the last circulation block, and the result derivation of the penultimate circulation block is completed at the same time of the operation.

When the data cache implementation method facing the SPM storage level is adopted, the method is mainly applied to a parallel language compiling system, the compiling instruction is used for guiding the compiler to complete the importing and exporting of data on the SPM storage level, the burden of a user is lightened, meanwhile, the compiling instruction information is fully utilized for guiding the compiler to purposefully perform efficient access optimization, the program performance is not required to be rapidly reduced due to the adoption of a conservative compiling strategy under the condition that the context cannot be accurately analyzed, the compiler can perform fine and efficient optimization according to the compiling instruction, and the context analysis accuracy of the compiler and the access efficiency of application are effectively improved; in addition, on the basis of analyzing the calculation overhead, the access overhead and the data transmission overhead of the loop, the compiler can optimize and hide partial data transmission time according to specific conditions, so that the calculation and the data transmission among different storage levels are carried out in parallel, each storage level of the processor is effectively utilized, and the program performance is further improved.

To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:

SRAM (static Random Access memory): static random access memory.

Spm (scratch Pad memory): and a scratchpad memory.

Cyclic blocking parameters: refers to the block size when cyclically chunked.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于Windows平台的多接口架构及实现方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类