Processor operation method and device and related product

文档序号:947607 发布日期:2020-10-30 浏览:19次 中文

阅读说明:本技术 处理器操作方法及装置以及相关产品 (Processor operation method and device and related product ) 是由 不公告发明人 于 2019-04-25 设计创作,主要内容包括:本公开涉及一种处理器操作方法及装置以及相关产品,所述产品包括控制模块,所述控制模块包括:指令缓存单元、指令处理单元和存储队列单元;所述指令缓存单元,用于存储所述人工神经网络运算关联的计算指令;所述指令处理单元,用于对所述计算指令解析得到多个运算指令;所述存储队列单元,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个运算指令或计算指令。通过以上方法,本公开可以提高相关产品在进行神经网络模型的运算时的运算效率。(The present disclosure relates to a processor operating method and apparatus, and related products, including a control module comprising: the device comprises an instruction cache unit, an instruction processing unit and a storage queue unit; the instruction cache unit is used for storing the calculation instruction associated with the artificial neural network operation; the instruction processing unit is used for analyzing the calculation instruction to obtain a plurality of operation instructions; the storage queue unit is configured to store an instruction queue, where the instruction queue includes: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue. Through the method, the operation efficiency of the related product in the operation of the neural network model can be improved.)

1. A method of processor operation, the method comprising:

when the first operation is an operation for a descriptor, acquiring a target storage area of tensor data indicated by the descriptor;

judging whether a second operation which is performed aiming at the target storage area exists or not;

when the second operation exists, judging whether a first fine-grained region in the target storage region currently aimed at by the first operation is overlapped with a second fine-grained region in the target storage region currently aimed at by the second operation;

Performing the first operation when the first fine-grained region and the second fine-grained region do not overlap.

2. The method of claim 1, further comprising:

blocking the first operation when the first fine-grained region overlaps the second fine-grained region.

3. The method of claim 1, wherein at least one of the first operation and the second operation is a write operation.

4. The method according to claim 1, wherein the size and/or number of fine-grained regions is determined according to at least one of the region where data of a set length is located and the region where data of a set dimension is located.

5. The method of claim 1, wherein the size and/or number of fine-grained regions is determined according to at least one of hardware computing power and hardware bandwidth.

6. The method of claim 1, wherein the first operation is an operation in a first processing instruction, wherein the second operation is an operation in a second processing instruction, and wherein the second processing instruction is a processing instruction in an instruction queue prior to the first processing instruction.

7. A processor-operated apparatus, the apparatus comprising:

the device comprises a region determining module, a storage module and a storage module, wherein the region determining module is used for acquiring a target storage region of tensor data indicated by a descriptor when a first operation is an operation aiming at the descriptor;

the operation judging module is used for judging whether a second operation which is carried out aiming at the target storage area exists or not;

an overlap determining module, configured to determine, when the second operation exists, whether there is an overlap between a first fine-grained region in the target storage region to which the first operation is currently directed and a second fine-grained region in the target storage region to which the second operation is currently directed;

an execution module, configured to execute the first operation when the first fine-grained region and the second fine-grained region do not overlap.

8. An artificial intelligence chip, wherein the chip comprises the processor-operated apparatus of claim 7.

9. An electronic device, characterized in that the electronic device comprises an artificial intelligence chip according to claim 8.

10. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface device and a control device and an artificial intelligence chip according to claim 8;

Wherein, the artificial intelligence chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the artificial intelligence chip and external equipment;

and the control device is used for monitoring the state of the artificial intelligence chip.

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for operating a processor, and a related product.

Background

With the continuous development of artificial intelligence technology, the amount of data and the data dimension which need to be processed are increasing. In the related art, when a plurality of operations operate on the same memory region, the subsequent operation may be executed after the previous operation is completed, which reduces the processing efficiency of the processor.

Disclosure of Invention

In view of the above, the present disclosure provides a processor operating solution.

According to an aspect of the present disclosure, there is provided a processor operating method including: when the first operation is an operation for a descriptor, acquiring a target storage area of tensor data indicated by the descriptor; judging whether a second operation which is performed aiming at the target storage area exists or not; when the second operation exists, judging whether a first fine-grained region in the target storage region currently aimed at by the first operation is overlapped with a second fine-grained region in the target storage region currently aimed at by the second operation; performing the first operation when the first fine-grained region and the second fine-grained region do not overlap. .

According to another aspect of the present disclosure, there is provided a processor operating device including: the device comprises a region determining module, a storage module and a storage module, wherein the region determining module is used for acquiring a target storage region of tensor data indicated by a descriptor when a first operation is an operation aiming at the descriptor; the operation judging module is used for judging whether a second operation which is carried out aiming at the target storage area exists or not; an overlap determining module, configured to determine, when the second operation exists, whether there is an overlap between a first fine-grained region in the target storage region to which the first operation is currently directed and a second fine-grained region in the target storage region to which the second operation is currently directed; an execution module, configured to execute the first operation when the first fine-grained region and the second fine-grained region do not overlap.

According to another aspect of the present disclosure, there is provided an artificial intelligence chip comprising a processor operating means as described above.

According to another aspect of the present disclosure, there is provided an electronic device including the artificial intelligence chip as described above.

According to another aspect of the present disclosure, a board card is provided, which includes: a memory device, an interface device and a control device and an artificial intelligence chip as described above; wherein, the artificial intelligence chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligence chip and external equipment; and the control device is used for monitoring the state of the artificial intelligence chip.

According to the embodiment of the present disclosure, when a first operation is an operation for a descriptor, a target storage area of tensor data indicated by the descriptor is acquired, and when there is an ongoing second operation for the target storage area, whether there is an overlap between a first fine-grained region in the target storage area to which the first operation is currently directed and a second fine-grained region in the target storage area to which the second operation is currently directed is determined, and when there is no overlap between the two, the first operation is executed. In this way, the fine-grained regions of the current operations of the first operation and the second operation can be executed without overlapping, so that the first operation and the second operation can simultaneously operate the target storage region, and the processing efficiency of the processor is improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a schematic diagram of an application scenario of a method of processor operation according to an embodiment of the present disclosure.

FIG. 2 shows a flow diagram of a method of operation of a processor in accordance with an embodiment of the present disclosure.

Fig. 3a and 3b show schematic diagrams of application scenarios of a method of operation of a processor according to an embodiment of the present disclosure.

Fig. 4a and 4b show schematic diagrams of application scenarios of a method of operation of a processor according to an embodiment of the present disclosure.

Fig. 5a and 5b show schematic diagrams of application scenarios of a method of operation of a processor according to an embodiment of the present disclosure.

FIG. 6 shows a schematic diagram of a circular buffer memory region of a method of operation of a processor according to an embodiment of the present disclosure.

FIG. 7 shows a schematic diagram of an overall memory area of a method of operation of a processor according to an embodiment of the present disclosure.

FIG. 8 shows a block diagram of a processor operating device, according to an embodiment of the present disclosure.

Fig. 9 shows a block diagram of a board card according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application scenario of a processor operating method according to an embodiment of the present disclosure. The processor 100 may be a general-purpose processor (e.g., a central processing unit CPU, a graphics processing unit GPU), or may be a special-purpose processor (e.g., an artificial intelligence processor, a scientific computing processor, or a digital signal processor), and the disclosure does not limit the type of the processor. The storage device 200 comprises at least one target storage area 210, wherein the target storage area 210 may be a storage area of tensor data indicated by the descriptors. It is understood that the processor may perform a read operation or a write operation to access a certain target storage area 210, and the processor performs a read operation for a certain target storage area 210, which may be to refer to the processor acquiring tensor data in the target storage area 210 through a descriptor. And the processor performs a write operation to a certain target storage area 210, which may be to refer to the processor writing tensor data indicated by the descriptor in the target storage area 210. In the related art, since the processor can execute a plurality of operations in parallel, in order to avoid conflict, when a plurality of operations executed in parallel by the processor are all operations directed to a certain target storage area 210, the processor will execute only one of the plurality of operations while blocking the other operations, thereby reducing the efficiency of the processor. The processor operation method provided by the present disclosure further divides the target storage area 210 into a plurality of fine-grained regions 211, when a plurality of operations executed in parallel by the processor are all operations directed to a certain target storage area 210, the processor may determine whether the fine-grained regions 211 targeted by the plurality of operations overlap, and if the fine-grained regions 211 targeted by the respective operations do not overlap, the plurality of operations may be executed in parallel, thereby greatly improving the efficiency of the processor. The storage device 200 may be disposed inside the processor (e.g., an on-chip cache, a register, or the like), or may be disposed outside the processor and may be in data communication with the processor (e.g., an off-chip memory, or the like). The present disclosure is not limited as to the type of storage device. The operation according to the present disclosure may be a basic operation supported by processor hardware, or may be a microinstruction (for example, a request signal) obtained by analyzing the basic operation. The present disclosure is not limited to a particular type of operation. The processor of the present disclosure may execute two operations in parallel, or may execute more than two operations in parallel, and the number of the operations executed in parallel is not limited in the present disclosure.

FIG. 2 shows a flow diagram of a method of operation of a processor in accordance with an embodiment of the present disclosure. The processor operation method is applicable to a processor. As shown in fig. 2, the processor operating method includes:

in step S11, when the first operation is an operation on a descriptor, a target storage area of tensor data indicated by the descriptor is acquired.

The first operation may be a read operation or a write operation for the descriptor. The present disclosure does not limit the specific type of the first operation. Descriptors can be used to indicate the shape of the tensor. The tensor can include various forms of data composition, the tensor can be of different dimensions, for example, a scalar can be regarded as a 0-dimensional tensor, a vector can be regarded as a 1-dimensional tensor, and a matrix can be a 2-dimensional tensor or a tensor with more than 2 dimensions. The shape of the tensor includes information such as the dimensions of the tensor, the sizes of the dimensions of the tensor, and the like. For example, for a tensor:

Figure BDA0002040453970000031

the shape of the tensor can be described by a descriptor as (2, 4), i.e. the tensor is represented by two parameters as a two-dimensional tensor, with the size of the first dimension (column) of the tensor being 2 and the size of the second dimension (row) being 4. It should be noted that the present disclosure is not limited to the way the descriptor indicates the tensor shape.

In one possible implementation, the operand of the first operation may be one or more descriptors. The target storage area may be a storage area of tensor data indicated by one or more descriptors in the operand of the first operation, or may be a storage space of tensor data indicated by any descriptor in the operand of the first operation. The whole memory area where the target memory area is located may be an internal memory of the processor (e.g., an on-chip cache or a register, etc.), or may be an external memory connected to the processor (e.g., an off-chip memory, etc.).

In one possible implementation, the target storage area may include at least one fine-grained region. The determination method of the size and/or the number of the fine-grained regions may include one or any combination of a determination method according to hardware design, a determination method according to a related operation of tensor data indicated by the descriptors, and a definition method according to related parameters in the operation. The size of the fine-grained region is determined by the hardware design, i.e. the size of one or more rows of the target memory region is determined as a fine-grained region. The method for determining the operation according to the tensor data indicated by the descriptor includes a method for determining a processing requirement, a storage method, a transmission method, or the like of the tensor data indicated by the descriptor, for example, the tensor data indicated by the descriptor is two-dimensional matrix data, the size of the two-dimensional matrix data is M × Q (M, Q are positive integers), the number of bytes occupied for storage is represented, that is, one row of M bytes, Q rows are shared, it can be determined that the M bytes are a fine-grained region, and a target storage region corresponding to the tensor data indicated by the descriptor includes Q fine-grained regions. And dividing the target storage area into a plurality of fine-grained areas according to the size and/or the number of the fine-grained carried in the operation. The fine particle size regions may be the same size or different sizes. For example, the number of data bits for each fine-grained region may be 64 bits, 256 bits, 512 bits, etc., respectively. The size and/or number of each fine-grained region may be determined as desired. The present disclosure is not limited thereto.

In one possible implementation manner, when the first operation is an operation on a descriptor, the target storage area of tensor data indicated by the descriptor may be obtained through the descriptor targeted by the first operation, so as to determine the target storage area targeted by the first operation.

Step S12, it is determined whether there is a second operation being performed with respect to the target storage area.

Where the second operation may be a read operation or a write operation, the present disclosure does not limit the specific type of second operation. The second operation may include one or more operands, and the operands of the second operation may be various data such as descriptors, immediate numbers, tensor data, and the like, and the specific number and type of the operands of the second operation are not limited by the present disclosure.

In one possible implementation manner, whether the second operation aiming at the target storage area is in progress or not can be judged according to the occupation state of the target storage area. For example, the processor may determine whether the target storage area is occupied by querying the occupancy status list, and if the target storage area is occupied, determine that there is a second operation being performed on the target storage area. The occupation state list may be preset and stored in the memory, or may be generated before the processor starts to execute a certain task and logged out after the task is completed. When the occupation state of each storage area changes, the processor updates the content of the occupation state list to record the occupation state of each storage area.

In one possible implementation, whether there is an ongoing second operation for the target storage area may be determined by querying the execution status of each operation. For example, a storage area corresponding to an operand of each operation may be recorded, and an execution state of each operation may be recorded. If the execution state of the operation aiming at the target storage area is not finished, the judgment result is that the second operation aiming at the target operation area is in progress. Whether the target storage area corresponding to the operand is occupied or not can be determined by judging the occupation state of the operand, so that whether a second operation aiming at the target storage area is performed or not can be determined. The present disclosure does not limit the criterion for determining whether there is an ongoing second operation for the target storage area.

In one possible implementation, the second operation may be an operation for a descriptor, the descriptor targeted by the second operation may be consistent with the descriptor targeted by the first operation, and then a storage area of the descriptor targeted by the second operation is consistent with the target storage area, and when the second operation is not completed, the second operation for the target storage area exists; or the storage area of the descriptor targeted by the second operation has an overlapping area with the target storage area, and when the second operation is performed on the overlapping area, the second operation is performed on the target storage area.

In one possible implementation, before a first operation performs an operation on a target storage area, it may be determined whether there is an ongoing second operation on the target storage area.

In a possible implementation manner, during the execution of the first operation on the target storage area, it may also be determined whether there is an ongoing second operation on the target storage area.

Step S13, when the second operation exists, determining whether there is an overlap between a first fine-grained region in the target storage area currently targeted by the first operation and a second fine-grained region in the target storage area currently targeted by the second operation.

The first fine-grained region and the second fine-grained region may be any fine-grained region of a plurality of fine-grained regions in the target storage area. The whole storage area where the target storage area is located may be divided into fine-grained regions, and the sizes of the fine-grained regions targeted by the operations for the whole storage area are the same.

Or, each operation performs fine-grained division on the storage area to which the operand is directed according to fine-grained division information carried in each operation, and then different operations may perform fine-grained division with different granularities for the same storage area. The first fine-grained region may be any fine-grained region in a plurality of fine-grained regions into which the target storage region is divided by the first operation, and the second fine-grained region may be any fine-grained region obtained by fine-grained dividing a storage region in which an operand is located by the second operation. The first fine-grained region and the second fine-grained region may be different sizes.

For example, a first operation may carry a first fine-grained size (e.g., number of data bits for each fine-grained region) and may set the first fine-grained size to 64 bits, while a second operation may carry a second fine-grained size (e.g., number of data bits for each fine-grained region) and may set the second fine-grained size to 256 bits. That is, every 64 bits is treated as a fine-grained region when the first operation is performed, and every 256 bits is treated as a fine-grained region when the second operation is performed. As another example, the fine-grained sizes (e.g., the number of data bits for each fine-grained region) carried by the first operation and the second operation are 512 bits. Likewise, a first operation may carry a first fine-grained number (e.g., set to 4) while a second operation carries a second fine-grained number (e.g., set to 8). That is, when the first operation is performed, the target storage area is divided into 4 fine-grained regions, and when the second operation is performed, the target storage area is divided into 8 fine-grained regions. It can be understood that the two parameters of the size and the number of the fine granularity can be carried simultaneously in the operation. The size and/or number of each fine-grained region may be determined as desired, and is not limited by this disclosure.

It is understood that the operation on the target storage area is an operation on each fine-grained region in the target storage area. For example, the target storage area a is a line 1 to a line 10, each line 1 is a fine-grained area, and the target storage area a includes 10 fine-grained areas. The write operation to the target storage area a can be regarded as a write operation to the 10 fine-grained regions. The execution process may be to write the 1 st fine-grained region (row 1), write the 2 nd fine-grained region (row 2) after the 1 st fine-grained region is written, write the 3 rd fine-grained region (row 3) after the 2 nd fine-grained region is written, and so on until the 10 th fine-grained region (row 10) is written, and complete the write operation of the target storage region a.

When there is an operation for the target storage area, the states of the fine-grained region in the target storage area may include a completed-operated state, an in-progress-operated state, and an unoperated state as the operation is performed. The state of the fine-grained region to which the operation is currently directed is an ongoing operation state. Thus, when there is an operation on the target storage area, it may be considered that there is an operation on one fine-grained region in the target storage area, and the fine-grained region being operated is the fine-grained region currently targeted by the operation.

In one possible implementation, the first fine-grained region in the target storage region to which the first operation is currently directed may include a fine-grained region, typically a first fine-grained region, in the target storage region to which the first operation is to be performed. The first operation currently executed in the target storage area may be a first operation executed in the target storage area, and the first operation executed in the target storage area may be a second operation executed in the target storage area. The second fine-grained region in the target storage region to which the second operation is currently directed may be a fine-grained region in the target storage region to which the second operation being executed is currently directed, and may be any one of the fine-grained regions.

In a possible implementation manner, whether a first fine-grained region in a target storage region to which a first operation is currently directed overlaps with a second fine-grained region in a target storage region to which a second operation is currently directed may be determined according to a physical address, a pointer position, a fine-grained region identifier, and the like. For example, the current physical address of each operation may be recorded, and according to the current physical address of the first operation, the current physical address of the second operation, and the correspondence between the physical address and the fine-grained region, a first fine-grained region in a target storage region to which the first operation is currently directed and a second fine-grained region in a target storage region to which the second operation is currently directed are respectively determined, so as to determine whether the first fine-grained region and the second fine-grained region overlap. The physical address may include one or any combination of a start address, an end address, an address of a set location, or a real-time operation address of the fine-grained region. As another example, a pointer may be set for each operation, the pointer pointing to the fine-grained region to which the operation is currently directed. According to the pointer position of the first operation and the pointer position of the second operation, a first fine-grained region in a target storage region to which the first operation is currently directed and a second fine-grained region in a target storage region to which the second operation is currently directed are respectively determined, and whether the first fine-grained region and the second fine-grained region are overlapped is further judged. For another example, an identifier may be set for each fine-grained region, and whether the first fine-grained region and the second fine-grained region overlap or not may be determined by recording the identifier of the fine-grained region currently targeted by the operation. The indicia may comprise any combination of letters, numbers or symbols. Whether the first fine-grained region and the second fine-grained region overlap can also be judged in other manners, and the judgment basis of whether the first fine-grained region and the second fine-grained region overlap is not limited in the present disclosure.

Step S14, when the first fine-grained region and the second fine-grained region do not overlap, executing the first operation.

In a possible implementation manner, if a first fine-grained region in a target storage region to which a first operation is currently directed does not overlap a second fine-grained region in a target storage region to which a second operation is currently directed, the first fine-grained region may be a fine-grained region in which the second operation has already been operated, or may be a fine-grained region in which the second operation does not need to be operated, at this time, executing the first operation does not affect an operation process and an operation result of the second operation, and the first operation may be executed.

According to the present embodiment, when the first operation is an operation for a descriptor, the target storage area of tensor data indicated by the descriptor is acquired, and when there is an ongoing second operation for the target storage area, whether there is an overlap between a first fine-grained region in the target storage area to which the first operation is currently directed and a second fine-grained region in the target storage area to which the second operation is currently directed is determined, and when there is no overlap between the first fine-grained region and the second fine-grained region, the first operation is executed. In this way, the fine-grained regions of the current operations of the first operation and the second operation can be executed without overlapping, so that the first operation and the second operation can simultaneously operate the target storage region, and the processing efficiency of the processor is improved.

In one possible implementation, the method may further include: blocking the first operation when the first fine-grained region overlaps the second fine-grained region.

In one possible implementation, the first fine-grained region overlaps the second fine-grained region, including the first fine-grained region completely overlapping or partially overlapping the second fine-grained region. When the first fine-grained region and the second fine-grained region overlap, if the first operation is executed, the first operation is directed at the operation of the overlapping part region, which may affect the execution of the second operation to cause an inaccurate operation result of the second operation, and may also affect the execution of the first operation to cause an inaccurate operation result of the first operation. At this time, the first operation may be blocked, that is, the execution of the first operation may be suspended, and the first operation may be executed after the second operation completes the operation on the second fine-grained region in the target storage region to which the second operation is currently directed. I.e., the first fine-grained region does not overlap the second fine-grained region, the first operation is performed.

In this embodiment, when the first fine-grained region and the second fine-grained region overlap, the first operation is blocked, so that operation errors and inaccurate operation results caused by the overlap of the fine-grained regions of the operations can be avoided, and the correctness of the operations is ensured.

Fig. 3a and 3b show schematic diagrams of application scenarios of a method of operation of a processor according to an embodiment of the present disclosure. As shown in fig. 3a and 3b, the whole storage area 20 includes a target storage area 21, where the target storage area 21 is divided into 4 fine-grained areas, which are a fine-grained area 22, a fine-grained area 23, a fine-grained area 24, and a fine-grained area 25.

As shown in fig. 3a, only write operations are currently involved, and a write pointer wp is used to indicate a fine-grained region in the target storage area 21 to which the write operation is currently directed. When a write operation is just started, the write pointer wp points to the fine-grained region 22, and it may be first determined whether there is an ongoing second operation on the target storage region 21, and if the determination result is that there is no second operation, the write operation is started on the fine-grained region 22; after the write operation on the fine-grained region 22 is completed, increasing the write pointer wp, namely wp + +, pointing to the next fine-grained region 23, and after the same judgment is performed, starting to write the fine-grained region 23; after the write operation on the fine-grained region 23 is completed, the write pointer wp is increased to point to the next fine-grained region 24, and after the same judgment is performed, the write operation on the fine-grained region 24 is started.

As also shown in FIG. 3b, two operations are currently involved, a read operation and a write operation, where the read operation is the first operation and the write operation is the second operation. And a write pointer wp for a write operation and a read pointer rp for a read operation are used to represent the fine-grained regions to which the write operation and the read operation are currently directed, respectively.

When a read operation (first operation) is performed, it is determined whether there is an ongoing second operation for the target storage area 21. Upon determining that there is currently an ongoing write operation for the second operation to the target storage area 21, it is further determined whether there is an overlap between the first fine-grained region (fine-grained region 22 in fig. 3 b) in the target storage area 21 to which the read operation (first operation) is currently directed and the second fine-grained region (fine-grained region 24 in fig. 3 b) in the target storage area 21 to which the write operation (second operation) is currently directed, for example, it may be determined that there is no overlap between the first fine-grained region and the second fine-grained region, based on the numbers (22 and 24) of the fine-grained regions, or based on the relationship (rp is 0, wp is 2, rp is < wp) between rp and wp, and then the read operation (first operation) may be performed.

When the read operation on the fine-grained region 22 is completed, rp is increased, namely rp + +, and is pointed to the next fine-grained region 23, and after the same judgment is performed, the first operation starts to operate on the fine-grained region 23; when the read operation for fine grain region 23 is completed, rp is incremented and pointed to the next fine grain region 24. In this case, whether the first fine-grained region and the second fine-grained region overlap is continuously determined, and if the numbers of the fine-grained regions are the same or the pointers rp are wp, it can be determined that the first fine-grained region in the target storage region 21 to which the first operation is currently directed overlaps the second fine-grained region in the target storage region 21 to which the second operation is currently directed, and the first operation is not executable and is blocked. When wp is increased and points to the next fine-grained region 25 after the second operation completes the operation on the fine-grained region 24, the numbers of the fine-grained regions are different (24 and 25) or the pointer rp < wp, and the first operation can be performed.

In one possible implementation, at least one of the first operation and the second operation may be a write operation. That is, when the operation on the operand is read after write (the second operation is a write operation, the first operation is a read operation), write after read (the second operation is a read operation, the first operation is a write operation), or write after write (both the second operation and the first operation are write operations), the method in the embodiment of the present disclosure may be adopted.

For example, if the first operation is a read operation for a descriptor, the second operation is a write operation, tensor data indicated by the descriptor that the first operation needs to read needs to be data after the write operation of the second operation, and the number of the second fine-grained region in the target storage region to which the second operation is directed is 8, the first operation can only read data of the fine-grained region numbered before 8. That is, if the first fine-grained region in the target storage region to which the first operation is currently directed is any one of the fine-grained regions numbered 1 to 7, the first operation may be performed.

In a possible implementation manner, if the first operation and the second operation are both read operations, the relationship between the fine-grained regions of the first operation and the second operation does not affect the operation result, and the method in the embodiment of the present disclosure may be adopted, or the first operation may be directly executed without determining the fine-grained region.

In this embodiment, when at least one of the first operation and the second operation is a write operation, by using the method in the embodiment of the present disclosure, by dividing the target storage area of the tensor data indicated by the descriptor into one or more fine-grained regions and executing the operation in units of the fine-grained regions, operations such as write after read, read after write, write after write and the like can be correctly executed, an accurate execution result is obtained, the waiting time between the operations can be reduced, and the execution efficiency of the processor is improved.

In a possible implementation manner, the size and/or the number of the fine-grained regions may be determined according to at least one of a region in which the data with a set length is located and a region in which the data with a set dimension is located.

It is understood that the size and/or number of the fine-grained regions may be predetermined before the operations are generated, or may be determined in real time when each operation is generated. The size and/or number of the fine-grained regions may be determined according to at least one of a region where data of a preset length is located and a region where data of a preset dimension is located. The preset length-setting data and the preset dimension-setting data may be independent of operands of the operations, may be determined comprehensively according to the operands of the operations, and may be determined according to requirements. The determining of the size and/or the number of the fine-grained regions in real time when each operation is generated may include determining data of a set length or data of a set dimension according to an operand of each operation, that is, determining at least one of a region in which the data of the set length is located and a region in which the data of the set dimension is located in real time according to a difference of the operands of each operation, and determining the size and/or the number of the fine-grained regions.

For example, the size and/or number of fine-grained regions may be determined based on the size of the region in which the data of a set length is located. For example, the size of the fine-grained region may be set according to the size of the target storage region in which the data of the set length in the tensor data indicated by the descriptor is located, and the region may be a fixed bit width. For example, if the tensor data B indicated by the descriptor is three-dimensional data of 20 × 10 × 5 and is stored in the target storage area in a manner of 40 × 25 (that is, 40 bits of data per line, 25 lines in total), the set length may be set to 40 bits, and 1 line of the target storage area may be set to one fine-grained region, and the target storage area of the tensor data B indicated by the descriptor may be divided into 25 fine-grained regions; each 5 lines of the target storage area may also be set as one fine-grained region, and the target storage area of the tensor data B indicated by the descriptor may be divided into 5 fine-grained regions. The present disclosure is not so limited.

It is understood that, according to at least one of the area where the data with the set length is located and the area where the data with the set dimension is located, the size and/or the number of the fine-grained regions may be determined in the target storage area, the size and/or the number of the fine-grained regions may also be determined in the entire storage area where the target storage area is located, and the size and/or the number of the fine-grained regions may be determined in other areas in the entire storage area. The above example only shows one case, and the present disclosure does not limit the applicable division range for determining the size and/or number of the fine-grained regions according to at least one of the region where the data of the set length is located and the region where the data of the set dimension is located.

In one possible implementation, the size and/or number of fine-grained regions may also be determined according to the size of the region in which the data of set dimensions is located. For example, the tensor data C indicated by the descriptor is two-dimensional data of 20 × 10, and the target storage area of the tensor data C indicated by the descriptor may be divided into 10 fine-grained areas according to data having a dimension of 1 dimension and a length of 20.

In addition, the size and/or the number of the fine-grained regions can be determined according to the size of the region where the data with the set length and the size of the region where the data with the set dimension are located in the target storage region of the tensor data indicated by the descriptor. For example, for the tensor data C indicated by the descriptor, the fine-grained region may be divided according to data having a dimension of 2 and a size of 4 × 2, so that the target storage region of the tensor data C indicated by the descriptor is divided into 25 fine-grained regions.

It should be understood that, the size and/or number of the divided fine-grained regions can be set by those skilled in the art according to practical situations, and the disclosure is not limited thereto.

In this embodiment, the size and/or the number of the fine-grained regions are determined according to the size of the region where the data with the set length is located and/or the size of the region where the data with the set dimension is located, the fine-grained regions can be divided according to the data characteristics, the flexibility of fine-grained region division can be improved, the efficiency of executing multiple operations is improved, the division result of the fine-grained regions can better meet the characteristics of different operands, the processing requirements of different types of operands can be met, and the overall execution efficiency of the multiple operations can be further improved.

In one possible implementation, the size and/or the number of the fine-grained regions may be determined according to at least one of hardware computing power and hardware bandwidth.

The hardware computing capacity may be the amount of data that the hardware processes in parallel in one computing cycle, and the hardware bandwidth may be the data transmission capacity, for example, the amount of data transmitted in a unit time.

For example, the processor using the processor operation method has a hardware computing capability of processing 100 bits of data in parallel in one computing cycle, a hardware bandwidth of transmitting 200 bits of data in a unit time, and for a target storage area with a size of 1000 bits, the target storage area can be divided into 10 fine-grained areas according to the hardware computing capability, wherein each fine-grained area includes 100 bits of data; the target storage area may also be divided into 5 fine-grained regions according to hardware bandwidth, where each fine-grained region includes 200 bits of data.

It should be understood that the hardware computing power and hardware bandwidth may vary according to the hardware of the processor, and the present disclosure does not limit the hardware computing power and hardware bandwidth.

It is to be understood that, according to at least one of the hardware computing power and the hardware bandwidth, the size and/or the number of the fine-grained regions may be determined in the target storage area, the size and/or the number of the fine-grained regions may also be determined in the entire storage area where the target storage area is located, and the size and/or the number of the fine-grained regions may be determined in other areas in the entire storage area. The above examples are given for one case only, and the present disclosure does not limit the applicable partitioning range for determining the size and/or number of fine-grained regions based on at least one of hardware computing power and hardware bandwidth.

By the method, the size and/or the number of the fine-grained regions can be determined according to the processing capacity (hardware computing capacity and/or hardware bandwidth) of the processor, so that the division result of the fine-grained regions better meets the requirements of different hardware use environments, the operation executed according to the fine-grained regions tends to be synchronous with the processing capacity of the processor, the execution efficiency of the hardware can be exerted as much as possible, and the processing efficiency of the processor is improved.

In one possible implementation, the first operation may be an operation in a first processing instruction, and the second operation may be an operation in a second processing instruction, the second processing instruction being a processing instruction preceding the first processing instruction in an instruction queue.

The first processing instruction and the second processing instruction may include one or more of a data access instruction, an operation instruction, a synchronization instruction, and a communication instruction, for example, the instruction is a read and operate instruction. The present disclosure is not limited to a particular type of first processing instruction and second processing instruction.

In one possible implementation, the first instruction may have a dependency relationship with the second instruction, e.g., the first instruction needs to use a calculation result of the second instruction. The first instruction may also have no dependency on the second instruction. When the first instruction and the second instruction have a dependency relationship, the first instruction and the second instruction can be executed in parallel by the method in the embodiment of the disclosure.

In this embodiment, the first operation and the second operation may be operations in different instructions, and by using the method in the embodiment of the present disclosure, instruction execution efficiency may be improved.

In one possible implementation, the first operation and the second operation may also be two operations in the same processing instruction, the second operation may be independent of the first operation, or the second operation may be based on the result of the first operation. The processing instruction may include one or more of a data access instruction, an arithmetic instruction, a synchronization instruction, and a communication instruction, for example, the instruction is a "read and add" arithmetic instruction, the first operation is a read operation, and the second operation is an add operation. The present disclosure is not limited to a particular type of processing instruction.

In one possible implementation, the method may further include: recording first position information of a first fine-grained region currently operated by the first operation and second position information of a second fine-grained region currently operated by the second operation,

step S14 may include: and when the first position information is inconsistent with the second position information, executing the first operation.

In one possible implementation, the location information may include identification information of the fine-grained region. The identification information may include any combination of numeric and alphabetic symbols. The present disclosure is not limited thereto.

For example, the location information may be represented using number information of fine-grained regions, and the target storage region may be divided into 6 fine-grained regions, the numbers of which are 0-5 in order. When a first operation and a second operation aiming at a target storage area are executed, recording first position information (for example, the number is 2) of a first fine-grained region currently operated by the first operation and second position information (for example, the number is 5) of a second fine-grained region currently operated by the second operation respectively; then, according to the relation (2<5) between the number information, the first position information is inconsistent with the second position information; at this time, the first operation may be performed.

In one possible implementation, the location information may include address information for a fine-grained region. The address information may include at least one of start address information, end address information, address information of a preset position in the fine-grained region, and address offset information of the fine-grained region. For example, the location information of the fine grain region is ADDR1515-ADDR 1531.

In this way, the position information can be used to directly and efficiently judge whether the first fine-grained region and the second fine-grained region overlap, and whether to execute the first operation is determined, so that the execution efficiency of the processor can be improved.

In one possible implementation, the first location information may include a first number of fine-grained regions on which a first operation has been completed, the second location information may include a second number of fine-grained regions on which a second operation has been completed,

when the first location information is inconsistent with the second location information, performing the first operation may include: when the first operation is an operation subsequent to the second operation and the first number is smaller than the second number, the first operation is executed.

For example, the first operation is an operation subsequent to the second operation, the target storage area of the first operation is divided into 6 fine-grained regions, and the position information of the first operation can be represented by the number of the fine-grained regions which are completed by the operation. When a first operation and a second operation aiming at a target storage area are executed, respectively recording a first number (for example, the first number is 3) of fine-grained areas operated and completed by the first operation and a second number (for example, the second number is 5) of fine-grained areas operated and completed by the second operation; then, according to the execution sequence of the first operation and the second operation and the relationship between the first number and the second number, it can be known that the first operation is an operation after the second operation, and the first number 3 is smaller than the second number 5, at this time, the first operation can be executed.

By the method, whether the first fine-grained region and the second fine-grained region are overlapped can be intuitively judged according to the number of the fine-grained regions which are operated, whether the first operation is executed or not is further determined, the judging process can be simplified, and the executing efficiency of the processor is improved.

In one possible implementation, the target storage area may include an operable area and a non-operable area, and step S13 may include: when the second operation exists and a first fine-grained region in the target storage region currently targeted by the first operation is located in the operable region, judging whether an overlap exists between the first fine-grained region in the target storage region currently targeted by the first operation and a second fine-grained region in the target storage region currently targeted by the second operation.

In one possible implementation, the target storage area may include one or more non-operable areas, and may also include a continuous or discontinuous non-operable area.

In one possible implementation, the target storage area may include one or more operable areas, and may also include a continuous or discontinuous operable area. The present disclosure is not limited thereto.

In one possible implementation manner, when the first operation is an operation for a descriptor, a target storage area of tensor data indicated by the descriptor may be acquired first, and then it may be determined whether there is an ongoing second operation for the target storage area; when the second operation exists, judging whether a first fine-grained region in a target storage region aimed at by the first operation is located in an operable region; when a second operation exists and a first fine-grained region in a target storage region targeted by the first operation is located in an operable region, judging whether the first fine-grained region in the target storage region targeted by the first operation is overlapped with a second fine-grained region in the target storage region targeted by the second operation; when there is no overlap between the first fine-grained region and the second fine-grained region, the first operation is performed.

In one possible implementation, the non-operable region may include an operation-prohibited region and a non-operation-prohibited region. If the first operation is a writing operation for the descriptor, when part of data in tensor data indicated by the descriptor can not be modified, a storage area where the part of data is located can be set as an operation-forbidden area so as to avoid modifying the part of data by mistake; if the ongoing second operation is a read operation (write after read) to read data before the first operation, one or more fine-grained regions where the second operation is located may be set as a non-operation-inhibited region, and when the second operation completes reading the non-operation-inhibited region, the non-operation-inhibited region may be changed to an operable region. The present disclosure does not limit the classification and division manner of the non-operable region.

In this embodiment, whether the fine-grained region of the first operation is operable or not may be determined first, and then the relationship between the fine-grained regions of different operations may be determined, so that on one hand, the efficiency of determination is improved, on the other hand, the specified data may be protected to prevent the occurrence of the incorrect operation, and the specified space may also be prohibited from being read and written, so that the space is reserved for executing other operations, and the flexibility of the processor in executing fine-grained synchronization is further improved.

In one possible implementation, the non-operable region may be a plurality of fine-grained regions including the second fine-grained region, and a location of the second fine-grained region within the non-operable region is updated with an operation location of the second operation, and the method may further include: and updating the position of the non-operable area after the second fine-grained area in the target storage area targeted by the second operation is moved out of the non-operable area.

That is, the non-operable area of the plurality of fine-grained regions including the second fine-grained region may not be updated with the update of the second fine-grained region in the target storage area to which the second operation is directed, and the location of the non-operable area is updated when the second fine-grained region in the target storage area to which the second operation is directed moves out of the non-operable area. For example, the non-operable region may be R fine-grained regions (R is an integer greater than 1) including the second fine-grained region, and the current non-operable region includes 2 nd to 2 nd + R-1 st fine-grained regions. And when the second operation is executed in the non-operable area for R fine-grained regions, moving out the non-operable area, and updating the position of the non-operable area along with the position of the fine-grained region targeted by the second operation, wherein the updated non-operable area comprises 2+ R to 2+ R + R-1 fine-grained regions. Wherein, the size of R can be determined arbitrarily according to the requirement.

Fig. 4a and 4b show schematic diagrams of application scenarios of a method of operation of a processor according to an embodiment of the present disclosure. As shown in fig. 4a, the target storage area 30 includes 8 fine-grained regions, wherein the operable area includes 5 fine-grained regions (fine-grained region 31, fine-grained region 35, fine-grained region 36, fine-grained region 37, and fine-grained region 38), and the non-operable area M0 includes 3 fine-grained regions (fine-grained region 32, fine-grained region 33, and fine-grained region 34). Wherein the second fine-grained region in the target storage region 30 to which the second operation is currently directed is the fine-grained region 32.

When the second operation has performed the operation on the fine-grained region 32, the second fine-grained region in the target storage region 30 to which the second operation is currently directed is the fine-grained region 33, and at this time, the second fine-grained region (the fine-grained region 33) in the target storage region 30 to which the second operation is currently directed is not moved out of the non-operable region, and the position of the non-operable region is not updated; when the second operation has performed the operation on the fine-grained region 33, the second fine-grained region in the target storage region 30 to which the second operation is currently directed is the fine-grained region 34, and at this time, the second fine-grained region (fine-grained region 34) in the target storage region 30 to which the second operation is currently directed is not moved out of the non-operable region, and the position of the non-operable region is not updated; when the second operation has performed the operation on the fine-grained region 34, the second fine-grained region in the target storage region 30 to which the second operation is currently directed is the fine-grained region 35, and at this time, the second fine-grained region (fine-grained region 35) in the target storage region 30 to which the second operation is currently directed is moved out of the non-operable region, and the position of the non-operable region is updated to the fine-grained regions 35, 36, and 37). Note that the size of the non-operable region is not limited in the present disclosure.

As shown in fig. 4b, after updating the position of the non-operable area, in the target storage area 30, the operable area includes 5 fine-grained areas (fine-grained area 31, fine-grained area 32, fine-grained area 33, fine-grained area 34, and fine-grained area 38), and the non-operable area M0 includes 3 fine-grained areas (fine-grained area 35, fine-grained area 36, and fine-grained area 37).

In this way, the position of the non-operable area does not need to be updated in real time, and the overhead generated by updating the non-operable area can be reduced.

In a possible implementation manner, the non-operable area may be a plurality of fine-grained areas including the second fine-grained area, and the second fine-grained area is located at a set position within the non-operable area, and the position of the non-operable area is updated with an operation position of the second operation.

That is, when the non-operable area is a plurality of fine-grained areas including the second fine-grained area, the position (e.g., intermediate position, last position, etc.) of the second fine-grained area in the non-operable area may be set, and the position of the non-operable area is updated with the operation position of the second operation. For example, the non-operable region may be R fine-grained regions including the second fine-grained region, the current non-operable region includes 2 nd to 2 nd + R-1 st fine-grained regions, and the set position of the second fine-grained region in the non-operable region is S th (where S ≦ R). And when the second operation finishes the operation on the current fine-grained region, the second operation starts to execute the operation on the next fine-grained region, at the moment, the position of the non-operable region is updated along with the operation position of the second operation, and the updated non-operable region comprises 2+1 th to 2+ R th fine-grained regions. The size of R and the value of S can be determined according to requirements. The present disclosure does not limit the number of fine-grained regions included in the non-operable region, nor the position of the second fine-grained region within the non-operable region.

Fig. 5a and 5b show schematic diagrams of application scenarios of a method of operation of a processor according to an embodiment of the present disclosure. As shown in fig. 5a, the target storage area 40 includes 8 fine-grained regions, where the operable region includes 5 fine-grained regions (fine-grained region 41, fine-grained region 45, fine-grained region 46, fine-grained region 47, and fine-grained region 48), and the non-operable region M1 includes 3 fine-grained regions (fine-grained region 42, fine-grained region 43, and fine-grained region 44). Wherein the second fine-grained region in the target storage region 40 to which the second operation is currently directed is set to the second fine-grained region located in the non-operable region M1, i.e., the fine-grained region 43.

When the second operation has performed the operation on the fine-grained region 43, the second fine-grained region in the target storage region 40 to which the second operation is currently directed is the fine-grained region 44, and at this time, the position of the non-operable region is updated with the operation position of the second operation, so that the second fine-grained region in the target storage region 40 to which the second operation is currently directed is located in the second fine-grained region of the non-operable region M1.

As shown in fig. 5b, after updating the position of the non-operable area, in the target storage area 40, the operable area includes 5 fine-grained areas (fine-grained area 41, fine-grained area 42, fine-grained area 46, fine-grained area 47, and fine-grained area 48), and the non-operable area M1 includes 3 fine-grained areas (fine-grained area 43, fine-grained area 44, and fine-grained area 45).

By the method, the position of the non-operable area can be updated in real time, and the synchronization degree of fine-grained processing is improved, so that the efficiency of data synchronization processing is further improved.

In one possible implementation, the target storage area may include: a circular buffer memory area. The circular buffer memory area can be used for circularly storing data.

FIG. 6 shows a schematic diagram of a circular buffer memory region of a method of operation of a processor according to an embodiment of the present disclosure. As shown in fig. 6, the target storage area 50 includes a circular buffer storage area 51 having addresses start _ addr to end _ addr.

For example, the second operation is a write operation, and an operand can be written into the circular buffer storage area 51, and the address pointer point sequentially stores data from start _ addr and stores the data downwards until end _ addr occupies the storage space of the circular buffer storage area 51, at this time, the address pointer point jumps back to start _ addr, and determines whether the address is used by the first operation requiring synchronization, if the address is used, the data is stored in the address to overwrite the original data, and then the address pointer point sequentially moves downwards until end _ addr, at this time, the data can be overwritten again, and the above process is circulated.

In this embodiment, the circular buffer storage area is used to store data, which not only saves the data storage space, but also improves the utilization rate of the storage space.

In one possible implementation, the circular buffer memory area may be divided into multiple fine-grained regions. For each fine-grained region, whether the data in the fine-grained region can be covered or not can be managed through a list or a flag bit or other manners, for example, an coverage flag bit can be set to indicate whether the data in the fine-grained region can be covered or not.

For example, the first operation is a read operation, the second operation is a write operation, i.e., read after write, and the write pointer wp and the read pointer rp may be used to represent the fine-grained regions currently targeted by the second operation and the first operation, respectively. When the coverage flag bit of the second fine-grained region currently targeted by the second operation is coverable, executing the second operation, writing data, after the data writing is completed, setting the coverage flag bit of the second fine-grained region as non-coverable, wp + +, where the second fine-grained region currently targeted by the second operation is the next fine-grained region, and if wp > end _ addr, wp is start _ addr; when the first fine-grained region and the second fine-grained region which are currently targeted by the first operation are not overlapped and the coverage flag bit of the first fine-grained region is not coverable, executing the first operation, reading data, after the data is read, setting the coverage flag bit of the first fine-grained region to be coverable, rp + +, the first fine-grained region which is currently targeted by the first operation is the next fine-grained region, and if rp > end _ addr, setting rp to be start _ addr; when the first fine-grained region and the second fine-grained region are overlapped, namely rp is wp, the first operation cannot be executed, and the first operation can be executed after the second operation finishes the operation on the second fine-grained region aimed at currently.

In the embodiment, the circular buffer access area is divided into a plurality of fine-grained areas, so that a plurality of operations can simultaneously operate on the circular buffer storage area, thereby improving the processing efficiency of the processor.

In one possible implementation, the fine-grained region may include a status identifier, which may include an operation-completed status or an operation-uncompleted status of operating the fine-grained region,

step S14 may include: when the first fine-grained region and the second fine-grained region are not overlapped, judging whether the state identifier of the first fine-grained region is in an operation finished state; and if so, executing the first operation.

In one possible implementation, the fine-grained region may include a status identifier, and the status identifier may include an operation completed status or an operation uncompleted status of operating the fine-grained region. For example, the status flag may be represented using 0 and 1, where 0 represents an operation incomplete status of operating the fine-grained region, and 1 represents an operation completed status of operating the fine-grained region, or 0 represents an operation completed status of operating the fine-grained region, and 1 represents an operation incomplete status of operating the fine-grained region. The present disclosure does not limit the manner in which the status flags are presented.

In one possible implementation manner, the second operation may set the status identifier of the fine-grained region in the target storage region, in which the operation is completed, to be in an operation completed status, and set the status identifier of the fine-grained region which is not operated or is being operated, to be in an operation incomplete status. The status flags of part of the fine-grained regions in which the operation is completed may also be set as the operation completed status, and the other fine-grained regions may also be set as the operation uncompleted status. For example, the second operation has completed 5 fine-grained regions, the status flags of the first 3 fine-grained regions may be set as the operation completed status, and the other fine-grained regions may be set as the operation incomplete status.

In a possible implementation manner, when there is an ongoing second operation directed to a target storage area, for a first fine-grained region currently directed to a first operation and a second fine-grained region currently directed to a second operation, after it is determined that the first fine-grained region and the second fine-grained region do not overlap, it may be determined whether a state identifier of the first fine-grained region is an operation completed state; if the state of the first fine-grained region is identified as the operation-completed state, the first operation may be performed.

In this embodiment, the fine-grained region includes a state identifier, and when the first fine-grained region and the second fine-grained region do not overlap, whether the first operation is executable or not is determined according to the state identifier of the first fine-grained region, so that the processing efficiency of the processor can be improved, and the accuracy of data processing can be improved.

In one possible implementation, the fine-grained region may include a status identifier, which may include an operation-completed status or an operation-uncompleted status of operating the fine-grained region,

step S14 may include: judging whether the state identifier of the first fine-grained region is in an operation finished state or not; and if so, executing the first operation when the first fine-grained region and the second fine-grained region are not overlapped.

That is, when there is an ongoing second operation directed to the target storage area, for a first fine-grained region currently directed to by the first operation and a second fine-grained region currently directed to by the second operation, after determining that the state of the first fine-grained region is identified as the operation completed state, it may be determined whether the first operation is executable according to an overlapping relationship between the first fine-grained region and the second fine-grained region. The first operation may be performed when there is no overlap between the first fine-grained region and the second fine-grained region.

In this embodiment, the fine-grained region includes a state identifier, and after it is determined that the state identifier of the first fine-grained region is the operation completed state, it may be determined whether the first operation is executable according to an overlapping relationship between the first fine-grained region and the second fine-grained region, so that the accuracy of data processing may be improved, and the processing efficiency of the processor may be improved.

In one possible implementation, the second operation is an operation on the descriptor.

That is, the operand of the second operation is the same as the operand of the first operation, that is, the second operation and the first operation are operations for the same descriptor, and the storage area of the operand of the second operation completely overlaps with the target storage area. After the target storage area of the tensor data indicated by the descriptor can be divided into a plurality of fine-grained areas, according to the method in the embodiment of the present disclosure, two operations operating the same descriptor can be executed in parallel without affecting the execution result of each operation.

In one possible implementation, when the operand of the second operation is the same as the operand of the first operation, the first fine-grained region and the second fine-grained region may be determined according to the execution progress of the operation, respectively.

In a possible implementation manner, when it is determined whether there is an ongoing second operation on the target storage area before the first operation performs an operation on the target storage area, a first fine-grained region in the target storage area currently targeted by the first operation is a fine-grained region in the target storage area to be performed by the first operation. For example, before a first operation performs an operation on a target storage area, a first fine-grained region in the target storage area to which the first operation is currently directed is typically a first fine-grained region of the target storage area. At this time, the first operation has not yet performed an operation on the first fine-grained region. And the second fine-grained region in the target storage region to which the ongoing second operation is currently directed may be related to the execution process of the second operation. If the second operation also just started executing, the second fine-grained region may also be the first fine-grained region of the target storage region. At this time, the first fine-grained region overlaps the second fine-grained region. If the second operation has completed the operation of the first fine-grained region, and the second fine-grained region in the target storage region currently targeted is the pth fine-grained region (P is an integer greater than 1), the first fine-grained region and the second fine-grained region are not overlapped.

In a possible implementation manner, when it is determined whether there is an ongoing second operation on the target storage area in the operation process of the first operation on the target storage area, the first fine-grained region may be determined according to an execution process of the first operation, the second fine-grained region may be determined according to an execution process of the second operation, and then it is determined whether the first fine-grained region and the second fine-grained region overlap. The first operation may be performed when there is no overlap between the first fine-grained region and the second fine-grained region.

In a possible implementation manner, if the beats of the operation execution processes are consistent, it may be determined whether there is an ongoing second operation on the target storage area only before the first operation executes the operation on the target storage area, and it may be determined whether the first fine-grained area and the second fine-grained area overlap. The consistent beats mean that the operation time of two operations for one fine-grained region is the same under the condition that the sizes of the fine-grained regions are the same.

In a possible implementation manner, if the beats of the operation execution processes are not consistent or whether the beats are consistent cannot be determined, in the operation process of the first operation on the target storage area, after each time the operation of the first fine-grained region in the target storage area currently targeted is completed, whether a second operation on the target operation area is currently performed is determined to continue, and whether the first fine-grained region and the second fine-grained region are overlapped to determine whether the first operation can be performed continuously.

In this embodiment, when the operand of the second operation is the same as the operand of the first operation, the method in the embodiment of the present disclosure may be used to implement parallel execution of two or more operations for the same descriptor, or determine a fine-grained region to which each operation is currently directed according to the execution progress of each operation, and execute the first operation when the first fine-grained region and the second fine-grained region do not overlap, so that two or more operations for the same descriptor may be executed in parallel without affecting the execution effect of each operation, thereby improving the execution efficiency of the processor.

In one possible implementation, the storage area of the operand of the second operation overlaps with the target storage area.

In a possible implementation manner, when the storage area of the operand of the second operation overlaps with the target storage area and the second operation is performing an operation on the overlapping area, the storage area of the operand of the second operation and the target storage area may be respectively divided into a plurality of fine-grained regions, or an entire storage area where the target storage area is located may be divided into a plurality of fine-grained regions, and then according to the method in the embodiment of the present disclosure, the fine-grained regions in the overlapping area are determined to determine whether the first operation may be performed, so that two operations in the storage area of the operand with the overlapping area may be performed in parallel without affecting an execution result of each operation.

In one possible implementation, when the second operation is an operation for a descriptor, and the descriptor targeted by the second operation is different from the descriptor targeted by the first operation, whether or not the storage area of the operand of the second operation and the target storage area may overlap may be determined according to whether or not the storage area of the tensor data indicated by the descriptor is allowed to overlap when the descriptor is created.

In one possible implementation, if the storage areas of the tensor data indicated by the descriptors are not allowed to overlap when the descriptors are created, when the descriptor targeted by the second operation is different from the descriptor targeted by the first operation, the storage area of the tensor data indicated by the descriptor targeted by the second operation is not overlapped with the target storage area, the fine-grained areas of the first operation and the second operation are not overlapped, and the judgment of the fine-grained areas is not required for the operations targeted by the different descriptors.

In one possible implementation, if the storage regions of tensor data indicated by the descriptors are allowed to overlap when creating the descriptors, when the descriptor targeted by the second operation is different from the descriptor targeted by the first operation, the storage region of the descriptor targeted by the second operation may overlap with the target storage region indicated by the descriptor targeted by the first operation, and at this time, the method in the embodiment of the present disclosure may be adopted so that two operations for different descriptors having overlapping regions may be executed in parallel.

In the embodiment, when the storage area of the operand of the second operation overlaps with the target storage area, the method in the embodiment of the disclosure can be used for realizing the parallel execution of two or more operations, thereby improving the execution efficiency of the processor.

In one possible implementation, the method may further include: and dividing the whole storage area where the target storage area is located into a plurality of fine-grained areas.

In one possible implementation, the target storage area may be a partial storage area or a whole storage area in an overall storage area of the storage device, where the overall storage area includes a plurality of preset fine-grained areas.

For example, the entire storage area where the target storage area is located is the RAM1, and the RAM1 may include m fine-grained areas (m is a positive integer) set in advance. The target storage area may occupy n fine-grained regions (n is a positive integer, and n < ═ m) in RAM1 in RAM 1. It should be noted that the target storage area may also include a partial area in a fine-grained area. In RAM1 as exemplified above, each fine-grained region is assumed to be a row in the overall storage area RAM1, 100 bits per row. The target storage area may comprise the first (n-1) full fine-grained regions and in turn comprise a partial region of the last fine-grained region, for example, the first 80 bits in the nth row (nth fine-grained region) in RAM 1.

In one possible implementation, when the entire storage area of the storage apparatus is divided into a plurality of fine-grained regions, any operation on any target storage area in the entire storage area, and whether the target storage area of tensor data indicated by a descriptor targeted by a first operation or an overlapping area of a storage area of an operand of a second operation and the target storage area, may determine the target storage area or the fine-grained region in the overlapping area according to a fine-grained division result of the entire storage area. Any operand of any operation is stored in a middle storage area of the whole storage area, and has a fine-grained area with the same size.

In a possible implementation manner, the size and/or the number of the fine-grained regions of the entire storage area may be determined according to hardware characteristics of the storage device, that is, the size and/or the number of the fine-grained regions of the entire storage area may be determined according to at least one of hardware computing capacity and hardware bandwidth of the storage device.

In this embodiment, the entire storage area where the target storage area is located is divided into a plurality of fine-grained areas, any operation on any target storage area in the entire storage area can be executed according to the same fine-grained size, and when different operations are parallel according to the method in the embodiment of the present disclosure, synchronization can be performed more conveniently, the parallelism of the operations is improved, and further, the processing efficiency of the processor is improved.

In one possible implementation, the method may further include:

dividing the target storage area into a plurality of fine-grained regions according to first fine-grained division information carried in a first operation, an

And dividing a storage area of an operand of a second operation into a plurality of fine-grained areas according to second fine-grained division information carried in the second operation.

In one possible implementation, fine-grained partition information may be carried in the operation, and the fine-grained partition information may include a size and/or a quantity of the fine-grained partition. Different operations may carry different fine-grained partition information. The same type of operation may carry the same fine-grained partition information. The setting position of the operand in the operation can carry fine-grained division information, and the operation code or the operand can carry identification information for judging whether fine-grained division is carried out. The content and the expression mode in the fine-grained division information are not limited by the disclosure.

In a possible implementation manner, the target storage area is divided into a plurality of first fine-grained regions according to first fine-grained division information carried in the first operation. The other regions in the whole storage region where the tensor data indicated by the descriptor targeted by the first operation are located may not be divided into fine granularities, and may also be divided into fine granularities according to fine granularity division information carried by other operations. The present disclosure is not limited thereto.

It is to be understood that when the operand of the second operation coincides with the tensor data indicated by the descriptor for the first operation, the storage area of the operand of the second operation completely overlaps with the target storage area. The first fine-grained partition information and the second fine-grained partition information may be consistent or inconsistent. When the first fine-grained partition information and the second fine-grained partition information are inconsistent, the target storage area can be subjected to fine-grained partition according to the second fine-grained partition information at the same time. That is, different operations may divide a target storage area into multiple fine-grained regions of different sizes or quantities. At this time, it may be determined whether the first fine-grained region and the second fine-grained region overlap according to a physical address of a first fine-grained region in a target storage region to which the first operation is currently directed and a physical address between second fine-grained regions in a target storage region to which the second operation is currently directed, and the first operation and the second operation are executed in parallel according to a determination result.

In a possible implementation manner, the fine-grained partition information carried in each operation may include a size and/or a number of the fine-grained region determined according to at least one of a region where operation data with a set length is located and a region where an operand with a set dimension is located, so that a fine-grained partition result better conforms to a type or an attribute of the operand in the operation.

In this embodiment, the target storage area is divided into a plurality of fine-grained regions according to first fine-grained division information carried in a first operation, and the storage area of the operand of a second operation is divided into a plurality of fine-grained regions according to second fine-grained division information carried in a second operation. And fine-grained division is carried out according to fine-grained division information carried in the operation, so that the fine-grained division result can better meet the processing requirements of each operation, and the operation is more flexible in parallel.

It should be understood that those skilled in the art can divide and set the target storage area into fine-grained areas according to actual situations, and the disclosure is not limited thereto.

In one possible implementation, the descriptor may be used to indicate a shape of tensor data of dimension N, N being an integer greater than or equal to zero, wherein the content of the descriptor may include at least one shape parameter representing the shape of the tensor data.

In one possible implementation, descriptors can be used to indicate the shape of the N-dimensional tensor data. The value of N may be determined according to the dimension (order) of the tensor data, or may be set according to the usage requirement of the tensor data. For example, the tensor data is three-dimensional data, N can take a value of 3 (determined by dimensions), and descriptors can be used to indicate the shape (e.g., offset, size, etc.) of the tensor data in three dimensional directions. It should be understood that the value of N can be set by those skilled in the art according to actual needs, and the disclosure does not limit this.

In one possible implementation, the descriptor may include an identifier and content, etc., and the identifier of the descriptor may be used to distinguish the descriptor, such as a number; the content of the descriptor may include at least one shape parameter (e.g., dimensions in various dimensional directions of the tensor, etc.) representing the shape of the tensor data. The present disclosure does not limit the specific shape parameters that the content of the descriptor includes.

In this embodiment, the shape of the tensor data can be indicated using the descriptor, and the correlation between the plurality of tensor data can be determined by the shape of the tensor data indicated by the descriptor, thereby improving the access efficiency of the processor.

In one possible implementation, the identity and content of the descriptor may be stored in a descriptor storage space, which may be a storage space in an internal memory of the control unit (e.g., a register, an on-chip SRAM, or other media cache, etc.). The target storage area of the tensor data indicated by the descriptor may be a storage space in an internal memory (e.g., an on-chip cache) of the control unit or an external memory (e.g., an off-chip memory) connected to the control unit. The data address in the target storage area may be an actual physical address or a virtual address. The present disclosure does not limit the location of the descriptor storage space and the target storage area and the type of data address.

In one possible implementation, the descriptor's identification, content, and tensor data indicated by the descriptor may be located in the same block, for example, a contiguous block of on-chip cache may be used to store the descriptor's associated content at addresses ADDR0-ADDR1023, where addresses ADDR0-ADDR31 may be used to store the descriptor's identification, addresses ADDR32-ADDR63 may be used to store the descriptor's content, and addresses ADDR64-ADDR1023 may be used to store the tensor data indicated by the descriptor. Here, the address ADDR is not limited to 1 bit or one byte, and is used herein to indicate one address, which is one address unit. The storage area and its address can be determined by those skilled in the art in practical situations, and the present disclosure is not limited thereto.

In one possible implementation, the identifier and content of the descriptor and the tensor data indicated by the descriptor may be stored in different areas of the internal memory, respectively, for example, a register may be used as a descriptor storage space, the identifier and content of the descriptor may be stored in the register, an on-chip cache may be used as an overall storage area, and the tensor data indicated by the descriptor may be stored.

In a possible implementation, a Special Register (SR) dedicated to the descriptor may be provided, and the data in the descriptor may be an immediate number or may be obtained from the special register. When the register is used to store the identifier and the content of the descriptor, the identifier of the descriptor may be represented by using the number of the register, for example, when the number of the register is 0, the identifier of the descriptor stored therein is 0. When the descriptor in the register is valid, an area may be allocated in the buffer space according to the size of the tensor data indicated by the descriptor (for example, a tensor buffer unit is created in the buffer for each tensor data) for storing the tensor data. It should be understood that the tensor data may also be stored in a preset buffer space, which is not limited by the present disclosure.

In one possible implementation, the identity and content of the descriptors may be stored in an internal memory and the tensor data indicated by the descriptors may be stored in an external memory. For example, the identification and content of the descriptors may be stored on-chip, and the tensor data indicated by the descriptors may be stored under-chip.

In one possible implementation, the data address of the storage area corresponding to the identification of the descriptor may be a fixed address. For example, separate storage areas may be divided for tensor data, and the start address of each tensor data in the storage area corresponds to the identifier of the descriptor one by one. In this case, the target storage area of the tensor data indicated by the descriptor can be directly determined according to the identifier and the corresponding relation of the descriptor.

In one possible implementation, the descriptor may further be used to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor may further include at least one address parameter representing the address of the tensor data.

In one possible implementation, the descriptor may also be used to indicate an address of the N-dimensional tensor data, where the address may be, for example, a start address of the tensor data, an address offset, a register address, or other address, and the disclosure does not limit the type of the address. For example, the address may be a start address of tensor data, and when a data address of a storage region corresponding to the identifier of the descriptor is a fixed address, the start address of the N-dimensional tensor data indicated by the descriptor is the fixed address; when the data address of the storage region corresponding to the identification of the descriptor is a variable address, the start address of the N-dimensional tensor data indicated by the descriptor may be determined according to its specific storage region. The present disclosure is not so limited.

In one possible implementation, the content of the descriptor may further include at least one address parameter representing an address of the tensor data. For example, the tensor data is 3-dimensional data, a descriptor may be used to indicate an address of the tensor data, and the content of the descriptor may include one address parameter (e.g., a start address of the tensor data) indicating the address of the tensor data, or may include a plurality of address parameters (e.g., a start address of the tensor data + an address offset, or address parameters of the tensor data based on each dimension) of the address of the tensor data. The address parameters can be set by those skilled in the art according to actual needs, and the disclosure does not limit this.

In this embodiment, the descriptor can be used to indicate the address of the tensor data, and the correlation among the plurality of tensor data can be determined by the address of the tensor data indicated by the descriptor, thereby improving the access efficiency of the processor.

In one possible implementation, the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in the entire storage area where the tensor data is located;

wherein the shape parameters of the tensor data comprise at least one of:

The size of the whole storage area where the tensor data are located in at least one direction of N dimensional directions, the size of the storage area of the tensor data in at least one direction of the N dimensional directions, the offset of the storage area of the tensor data in at least one direction of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relation between the data description position of the tensor data indicated by the descriptor and the data address.

In one possible implementation, the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in the entire storage area in which the tensor data is located. Wherein the reference address may be different according to a variation of the data reference point. The present disclosure does not limit the selection of data reference points.

In one possible implementation, the base address may comprise a start address of the entire memory region. When the data reference point of the descriptor is the first data block of the entire storage area, the reference address of the descriptor is the start address of the entire storage area. When the data reference point of the descriptor is other data than the first data block in the whole storage area, the reference address of the descriptor is the physical address of the data block in the whole storage area.

In one possible implementation, the shape parameters of the tensor data include at least one of: the size of the whole storage area in which the tensor data are located in at least one direction of the N dimensional directions, the size of the storage area of the tensor data in at least one direction of the N dimensional directions, the offset of the storage area of the tensor data in at least one direction of the N dimensional directions, the positions of at least two vertexes located at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relation between the data description position of the tensor data indicated by the descriptor and the data address. Where the data description position is a mapping position of a point or a region in the tensor data indicated by the descriptor, for example, when the tensor data is 3-dimensional data, the descriptor may represent a shape of the tensor data using three-dimensional space coordinates (x, y, z), and the data description position of the tensor data may be a position of a point or a region in the three-dimensional space to which the tensor data is mapped, which is represented using three-dimensional space coordinates (x, y, z).

It should be understood that the shape parameters representing tensor data can be selected by one skilled in the art based on practical circumstances, and the present disclosure is not limited thereto.

FIG. 7 shows a schematic diagram of an overall memory area of a method of operation of a processor according to an embodiment of the present disclosure. As shown in fig. 7, the entire storage area 71 stores a two-dimensional data in a line-first manner, which can be represented by (X, Y) (where the X axis is horizontally right and the Y axis is vertically downward), the size in the X axis direction (the size of each line) is ori _ X (not shown in the figure), the size in the Y axis direction (the total number of lines) is ori _ Y (not shown in the figure), and the start address PA _ start (the reference address) of the entire storage area 71 is the physical address of the first data block 72. The data block 73 is tensor data in the entire storage area 71, and an offset amount 75 in the X-axis direction is denoted as offset _ X, an offset amount 74 in the Y-axis direction is denoted as offset _ Y, a size in the X-axis direction is denoted as size _ X, and a size in the Y-axis direction is denoted as size _ Y.

In one possible implementation, when the data block 73 is defined by using a descriptor, the data reference point of the descriptor may use the first data block of the entire storage area 71, and the reference address of the descriptor is the starting address PA _ start of the entire storage area 71, and then the content of the descriptor of the data block 73 may be determined by combining the size ori _ X of the entire storage area 71 in the X axis, the size ori _ Y of the entire storage area 71 in the Y axis, and the offset _ Y of the data block 73 in the Y axis direction, the offset _ X in the X axis direction, the size _ X in the X axis direction, and the size _ Y in the Y axis direction.

In one possible implementation, the content of the descriptor can be represented using the following formula (1):

it should be understood that, although the descriptor describes a two-dimensional space in the above example, the dimension of the content representation of the descriptor can be set by those skilled in the art according to the actual situation, and the disclosure does not limit this.

In one possible implementation, the content of the descriptor of the tensor data can be determined according to a reference address of the data reference point of the descriptor in the whole storage area and the positions of at least two vertexes at diagonal positions in the N-dimensional directions relative to the data reference point.

For example, the contents of the descriptor of the data block 73 in fig. 7 may be determined using the reference address PA _ base of the data reference point of the descriptor in the entire storage area, and the positions of the two vertices of the angular position with respect to the data reference point. First, a data reference point of the descriptor and a reference address PA _ base thereof in the entire storage area are determined, for example, one data (for example, data with a position of (2, 2)) may be selected as the data reference point in the entire storage area 71, and a physical address of the data in the entire storage area 71 is taken as the reference address PA _ base; then, the positions of at least two vertices of the diagonal positions of the data block 73 with respect to the data reference point are determined, for example, using the positions of the diagonal position vertices with respect to the data reference point in the top-left to bottom-right direction, where the relative position of the top-left vertex is (x _ min, y _ min) and the relative position of the bottom-right vertex is (x _ max, y _ max), and then the content of the descriptor of the data block 73 can be determined according to the reference address PA _ base, the relative position of the top-left vertex (x _ min, y _ min), and the relative position of the bottom-right vertex (x _ max, y _ max).

In one possible implementation, the content of the descriptor can be represented using the following equation (2):

Figure BDA0002040453970000182

it should be understood that although the above examples use two vertices of the upper left corner and the lower right corner to determine the content of the descriptor, those skilled in the art can set the specific vertex of the at least two vertices according to actual needs, and the disclosure is not limited thereto.

In one possible implementation, the content of the descriptor of the tensor data can be determined according to the reference address of the data reference point of the descriptor in the whole storage area and the mapping relation between the data description position and the data address of the tensor data indicated by the descriptor. The mapping relationship between the data description position and the data address may be set according to actual needs, for example, when tensor data indicated by the descriptor is three-dimensional space data, the mapping relationship between the data description position and the data address may be defined by using a function f (x, y, z).

In one possible implementation, the content of the descriptor can be represented using the following equation (3):

it should be understood that, a person skilled in the art may set the mapping relationship between the data description location and the data address according to practical situations, and the disclosure does not limit this.

In this embodiment, the contents of the descriptor can be determined using the base address of the tensor data and the at least one shape reference, so that different descriptors can be used according to the processing needs of the respective operations.

In a possible implementation manner, according to the identifier of the descriptor, the position of the descriptor in the descriptor storage space can be determined, and then the content of the descriptor is obtained from the descriptor storage space; the data address of the tensor data indicated by the descriptor in the whole storage area can be determined according to the content of the descriptor. The calculation of the data address can be automatically completed through hardware or realized through a software mode. The way of calculating the data address of the tensor data indicated by the descriptor in the entire storage area may also be different when the contents of the descriptor are different.

For example, in the case where the content of the descriptor is expressed by equation (1), the amounts of shift of the tensor data indicated by the descriptor in the entire storage area are offset _ x and offset _ y, respectively, and the size is size _ x × size _ y, then the start data address PA1 of the tensor data indicated by the descriptor in the entire storage area is(x,y)The following equation (4) may be used to determine:

PA1(x,y)=PA_start+(offset_y-1)*ori_x+offset_x (4)

For any data point in tensor data indicated by a descriptor, the data description position is set as (x)q,yq) Then the data address PA2 of the data point in the global storage area(x,y)The following equation (5) may be used to determine:

PA2(x,y)=PA_start+(offset_y+yq-1)*ori_x+(offset_x+xq) (5)

it should be understood that the calculation method for determining the data address can be determined by those skilled in the art according to practical situations, and the present disclosure is not limited thereto.

In this embodiment, the content of the descriptor can be obtained according to the identifier of the descriptor; and according to the content of the descriptor, the data address of the tensor data indicated by the descriptor in the whole storage area; from the data address, a target storage area of tensor data indicated by the descriptor can be determined. In this way, the descriptor can be used in the data access process, so that the complexity of data access can be reduced, and the processing efficiency of the processor can be improved.

It should be noted that, although the above embodiments are described as examples of the processor operation method, those skilled in the art can understand that the disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.

FIG. 8 shows a block diagram of a processor operating device, according to an embodiment of the present disclosure. As shown in fig. 8, the processor operating device includes:

a region determining module 81, configured to, when the first operation is an operation on a descriptor, acquire a target storage region of tensor data indicated by the descriptor;

an operation determination module 82, configured to determine whether there is an ongoing second operation on the target storage area;

an overlap determining module 83, configured to determine, when the second operation exists, whether there is an overlap between a first fine-grained region in the target storage region to which the first operation is currently directed and a second fine-grained region in the target storage region to which the second operation is currently directed;

an execution module 84, configured to execute the first operation when the first fine-grained region and the second fine-grained region do not overlap.

In one possible implementation, the apparatus further includes: an execution control module to block the first operation when the first fine-grained region overlaps the second fine-grained region.

In one possible implementation, at least one of the first operation and the second operation is a write operation.

In a possible implementation manner, the size and/or the number of the fine-grained regions are determined according to at least one of a region in which the data with the set length is located and a region in which the data with the set dimension is located.

In one possible implementation, the size and/or number of the fine-grained regions is determined according to at least one of hardware computing power and hardware bandwidth.

In one possible implementation, the first operation is an operation in a first processing instruction, the second operation is an operation in a second processing instruction, and the second processing instruction is a processing instruction in an instruction queue prior to the first processing instruction.

In one possible implementation, the apparatus further includes: a position recording module, configured to record first position information of a first fine-grained region currently operated by the first operation and second position information of a second fine-grained region currently operated by the second operation,

the execution module 84 includes: and the first execution submodule is used for executing the first operation when the first position information is inconsistent with the second position information.

In one possible implementation, the first location information includes a first number of fine-grained regions on which a first operation has completed, the second location information includes a second number of fine-grained regions on which a second operation has completed,

The first execution sub-module is further configured to: when the first operation is an operation subsequent to the second operation and the first number is smaller than the second number, the first operation is executed.

In a possible implementation manner, the target storage area includes an operable area and a non-operable area, and the overlap determining module 83 includes: a determining submodule, configured to determine, when the second operation exists and a first fine-grained region in the target storage region to which the first operation is currently directed is located in the operable region, whether the first fine-grained region in the target storage region to which the first operation is currently directed overlaps with a second fine-grained region in the target storage region to which the second operation is currently directed.

In one possible implementation manner, the non-operable region is a plurality of fine-grained regions including the second fine-grained region, and a position of the second fine-grained region in the non-operable region is updated with an operation position of the second operation, and the apparatus further includes: and the updating module is used for updating the position of the non-operable area after the second fine-grained area in the target storage area targeted by the second operation is moved out of the non-operable area.

In a possible implementation manner, the non-operable area is a plurality of fine-grained areas including the second fine-grained area, and the second fine-grained area is located at a set position within the non-operable area, and the position of the non-operable area is updated with an operation position of the second operation.

In one possible implementation, the target storage area includes: a circular buffer memory area.

In one possible implementation manner, the fine-grained region includes a status identifier, the status identifier includes an operation completed status or an operation uncompleted status for operating the fine-grained region, and the execution module 84 includes:

a first state judgment submodule, configured to judge whether a state identifier of the first fine-grained region is an operation completed state when the first fine-grained region and the second fine-grained region are not overlapped;

and the second execution submodule is used for executing the first operation when the state of the first fine-grained region is identified as an operation finished state.

In one possible implementation manner, the fine-grained region includes a status identifier, the status identifier includes an operation completed status or an operation uncompleted status for operating the fine-grained region, and the execution module 84 includes:

A second state judgment submodule, configured to judge whether a state identifier of the first fine-grained region is an operation completed state;

and the third execution submodule is used for executing the first operation when the state of the first fine-grained region is identified as an operation finished state and the first fine-grained region and the second fine-grained region are not overlapped.

In one possible implementation, the second operation is an operation on the descriptor.

In one possible implementation, the storage area of the operand of the second operation overlaps with the target storage area.

In one possible implementation, the apparatus further includes: the first setting module is used for dividing the whole storage area where the target storage area is located into a plurality of fine-grained areas.

In one possible implementation, the apparatus further includes:

a second setting module, configured to divide the target storage area into multiple fine-grained regions according to the first fine-grained division information carried in the first operation, and

and the third setting module is used for dividing a storage area of an operand of the second operation into a plurality of fine-grained areas according to second fine-grained division information carried in the second operation.

In one possible implementation, the descriptor is used to indicate a shape of tensor data of N dimensions, N being an integer greater than or equal to zero, wherein the content of the descriptor includes at least one shape parameter representing the shape of the tensor data.

In one possible implementation, the descriptor is further configured to indicate an address of the N-dimensional tensor data, wherein the content of the descriptor further includes at least one address parameter representing the address of the tensor data.

In one possible implementation, the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in the entire storage area where the tensor data is located; wherein the shape parameters of the tensor data comprise at least one of: the size of the whole storage area where the tensor data are located in at least one direction of N dimensional directions, the size of the storage area of the tensor data in at least one direction of the N dimensional directions, the offset of the storage area of the tensor data in at least one direction of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relation between the data description position of the tensor data indicated by the descriptor and the data address.

In a possible implementation manner, an artificial intelligence chip is further disclosed, and the artificial intelligence chip comprises the processor operating device.

In a possible implementation manner, a board card is further disclosed, which comprises a storage device, an interface device, a control device and the artificial intelligence chip; wherein, the artificial intelligence chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligence chip and external equipment; and the control device is used for monitoring the state of the artificial intelligence chip.

In one possible implementation, the storage device includes: the artificial intelligence chip comprises a plurality of groups of storage units, wherein each group of storage unit is connected with the artificial intelligence chip through a bus, and the storage units are as follows: DDR SDRAM; the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit; the interface device is as follows: a standard PCIE interface.

Fig. 9 shows a block diagram of a board according to an embodiment of the present disclosure, and referring to fig. 9, the board may include other kit components besides the chip 389, where the kit components include, but are not limited to: memory device 390, interface device 391 and control device 392;

The storage device 390 is connected to the artificial intelligence chip through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the artificial intelligence chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the artificial intelligence chip may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with the artificial intelligence chip. The interface device is used for realizing data transmission between the artificial intelligence chip and external equipment (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the specific expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the artificial intelligence chip is still transmitted back to the external device (e.g. server) by the interface device.

The control device is electrically connected with the artificial intelligence chip. The control device is used for monitoring the state of the artificial intelligence chip. Specifically, the artificial intelligence chip and the control device can be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). As the artificial intelligence chip can comprise a plurality of processing chips, a plurality of processing cores or a plurality of processing circuits, a plurality of loads can be driven. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the artificial intelligence chip.

In one possible implementation, an electronic device is disclosed that includes the artificial intelligence chip described above. The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

A1, a method of processor operation, the method comprising:

when the first operation is an operation for a descriptor, acquiring a target storage area of tensor data indicated by the descriptor;

judging whether a second operation which is performed aiming at the target storage area exists or not;

when the second operation exists, judging whether a first fine-grained region in the target storage region currently aimed at by the first operation is overlapped with a second fine-grained region in the target storage region currently aimed at by the second operation;

Performing the first operation when the first fine-grained region and the second fine-grained region do not overlap.

A2, the method of claim a1, the method further comprising:

blocking the first operation when the first fine-grained region overlaps the second fine-grained region.

A3, the method of claim A1, at least one of the first operation and the second operation being a write operation.

A4, the method according to claim A1, wherein the size and/or number of the fine-grained regions is/are determined according to at least one of the region of the data with set length and the region of the data with set dimension.

A5, the method of claim a1, wherein the size and/or number of fine-grained regions is determined according to at least one of hardware computing power and hardware bandwidth.

A6, the method of claim a1, the first operation being an operation in a first processing instruction, the second operation being an operation in a second processing instruction, the second processing instruction being a processing instruction in an instruction queue preceding the first processing instruction.

A7, the method of any one of claims a1-a6, the method further comprising:

Recording first position information of a first fine-grained region currently operated by the first operation and second position information of a second fine-grained region currently operated by the second operation,

when the first fine-grained region and the second fine-grained region do not overlap, performing the first operation, including:

and when the first position information is inconsistent with the second position information, executing the first operation.

A8, the method of claim A7, the first location information including a first number of fine-grained regions on which a first operation has been completed, the second location information including a second number of fine-grained regions on which a second operation has been completed,

when the first location information is inconsistent with the second location information, performing the first operation, including:

when the first operation is an operation subsequent to the second operation and the first number is smaller than the second number, the first operation is executed.

A9, the method of claim a1, the target storage area including an operable area and a non-operable area, and determining whether there is an overlap between a first fine-grained region in the target storage area currently targeted by the first operation and a second fine-grained region in the target storage area currently targeted by the second operation when the second operation exists, including:

When the second operation exists and a first fine-grained region in the target storage region currently targeted by the first operation is located in the operable region, judging whether the first fine-grained region in the target storage region currently targeted by the first operation is overlapped with a second fine-grained region in the target storage region currently targeted by the second operation.

A10, the method of claim a9, the non-operable area being a plurality of fine-grained areas including the second fine-grained area, and the location of the second fine-grained area within the non-operable area being updated with the operating location of the second operation, the method further comprising:

and updating the position of the non-operable area after the second fine-grained area in the target storage area targeted by the second operation is moved out of the non-operable area.

A11, the method of claim A9, the non-operable area being a plurality of fine grained areas including the second fine grained area, and the second fine grained area being located at a set position within the non-operable area,

the position of the non-operable area is updated with the operation position of the second operation.

A12, the method of claim A1, the target storage area comprising: a circular buffer memory area.

A13, the method of claim A1, the fine grained region including a state identification, the state identification including an operation completed state or an operation incomplete state of operating the fine grained region,

when the first fine-grained region and the second fine-grained region do not overlap, performing the first operation, including:

when the first fine-grained region and the second fine-grained region are not overlapped, judging whether the state identifier of the first fine-grained region is in an operation finished state;

and if so, executing the first operation.

A14, the method of claim A1, the fine grained region including a state identification, the state identification including an operation completed state or an operation incomplete state of operating the fine grained region,

when the first fine-grained region and the second fine-grained region do not overlap, performing the first operation, including:

judging whether the state identifier of the first fine-grained region is in an operation finished state or not;

and if so, executing the first operation when the first fine-grained region and the second fine-grained region are not overlapped.

A15, the method of claim A1, the second operation being an operation on the descriptor.

A16, the method of claim a1, the operand of the second operation having a storage region that overlaps the target storage region.

A17, the method of any one of claims a1-a16, the method further comprising:

and dividing the whole storage area where the target storage area is located into a plurality of fine-grained areas.

A18, the method of any one of claims a1-a17, the method further comprising:

dividing the target storage area into a plurality of fine-grained regions according to first fine-grained division information carried in a first operation, an

And dividing a storage area of an operand of a second operation into a plurality of fine-grained areas according to second fine-grained division information carried in the second operation.

A19, the method of claim A1, the descriptor indicating the shape of the tensor data in N dimensions, N being an integer greater than or equal to zero,

wherein the content of the descriptor comprises at least one shape parameter representing a shape of tensor data.

A20, the method of claim a19, the descriptor further indicating an address of the N-dimensional tensor data, wherein the content of the descriptor further comprises at least one address parameter representing the address of the tensor data.

A21, the method of claim a20, the address parameters of the tensor data include a reference address of a data reference point of the descriptor in an overall storage area in which the tensor data are located;

wherein the shape parameters of the tensor data comprise at least one of:

the size of the whole storage area where the tensor data are located in at least one direction of N dimensional directions, the size of the storage area of the tensor data in at least one direction of the N dimensional directions, the offset of the storage area of the tensor data in at least one direction of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relation between the data description position of the tensor data indicated by the descriptor and the data address.

A22, a processor-operated apparatus, the apparatus comprising:

the device comprises a region determining module, a storage module and a storage module, wherein the region determining module is used for acquiring a target storage region of tensor data indicated by a descriptor when a first operation is an operation aiming at the descriptor;

the operation judging module is used for judging whether a second operation which is carried out aiming at the target storage area exists or not;

An overlap determining module, configured to determine, when the second operation exists, whether there is an overlap between a first fine-grained region in the target storage region to which the first operation is currently directed and a second fine-grained region in the target storage region to which the second operation is currently directed;

an execution module, configured to execute the first operation when the first fine-grained region and the second fine-grained region do not overlap.

A23, the apparatus of claim a22, the apparatus further comprising:

an execution control module to block the first operation when the first fine-grained region overlaps the second fine-grained region.

A24, the apparatus of claim A22, at least one of the first operation and the second operation being a write operation.

A25, the device according to claim A22, wherein the size and/or number of the fine-grained regions is/are determined according to at least one of the region where the data with set length is located and the region where the data with set dimension is located.

A26, the apparatus of claim A22, wherein the size and/or number of the fine-grained regions is/are determined according to at least one of hardware computing power and hardware bandwidth.

A27, the apparatus of claim a22, the first operation being an operation in a first processing instruction, the second operation being an operation in a second processing instruction, the second processing instruction being a processing instruction in an instruction queue preceding the first processing instruction.

A28, the device of any one of claims a22-a27, the device further comprising:

a position recording module, configured to record first position information of a first fine-grained region currently operated by the first operation and second position information of a second fine-grained region currently operated by the second operation,

the execution module includes:

and the first execution submodule is used for executing the first operation when the first position information is inconsistent with the second position information.

A29, the apparatus of claim A28, the first location information comprising a first number of fine-grained regions on which a first operation has been completed, the second location information comprising a second number of fine-grained regions on which a second operation has been completed,

the first execution sub-module is further configured to:

when the first operation is an operation subsequent to the second operation and the first number is smaller than the second number, the first operation is executed.

A30, the apparatus of claim A22, the target storage area comprising an operable area and a non-operable area, the overlap determining module comprising:

a determining submodule, configured to determine, when the second operation exists and a first fine-grained region in the target storage region to which the first operation is currently directed is located in the operable region, whether the first fine-grained region in the target storage region to which the first operation is currently directed overlaps with a second fine-grained region in the target storage region to which the second operation is currently directed.

A31, the apparatus of claim a30, the non-operable region being a plurality of fine-grained regions including the second fine-grained region, and the location of the second fine-grained region within the non-operable region being updated with the operating location of the second operation, the apparatus further comprising:

and the updating module is used for updating the position of the non-operable area after the second fine-grained area in the target storage area targeted by the second operation is moved out of the non-operable area.

A32, the apparatus of claim A30, the non-operable region being a plurality of fine grained regions including the second fine grained region, and the second fine grained region being located at a set position within the non-operable region,

The position of the non-operable area is updated with the operation position of the second operation.

A33, the apparatus of claim A22, the target storage area comprising: a circular buffer memory area.

A34, the apparatus of claim A22, the fine grained region including a state identification including an operation completed status or an operation incomplete status of operating the fine grained region,

the execution module includes:

a first state judgment submodule, configured to judge whether a state identifier of the first fine-grained region is an operation completed state when the first fine-grained region and the second fine-grained region are not overlapped;

and the second execution submodule is used for executing the first operation when the state of the first fine-grained region is identified as an operation finished state.

A35, the apparatus of claim A22, the fine grained region including a state identification including an operation completed status or an operation incomplete status of operating the fine grained region,

the execution module includes:

a second state judgment submodule, configured to judge whether a state identifier of the first fine-grained region is an operation completed state;

And the third execution submodule is used for executing the first operation when the state of the first fine-grained region is identified as an operation finished state and the first fine-grained region and the second fine-grained region are not overlapped.

A36, the apparatus of claim A22, the second operation being an operation on the descriptor.

A37, the apparatus of claim A22, the storage area of the operand for the second operation overlapping the target storage area.

A38, the device of any one of claims a22-a37, the device further comprising:

the first setting module is used for dividing the whole storage area where the target storage area is located into a plurality of fine-grained areas.

A39, the device of any one of claims a22-a37, the device further comprising:

a second setting module, configured to divide the target storage area into multiple fine-grained regions according to the first fine-grained division information carried in the first operation, and

and the third setting module is used for dividing a storage area of an operand of the second operation into a plurality of fine-grained areas according to second fine-grained division information carried in the second operation.

A40, the apparatus of claim A22, the descriptor being for indicating the shape of tensor data of dimension N, N being an integer greater than or equal to zero,

wherein the content of the descriptor comprises at least one shape parameter representing a shape of tensor data.

A41, the apparatus of claim a40, the descriptor further for indicating an address of tensor data of the N-dimension, wherein the content of the descriptor further comprises at least one address parameter representing the address of the tensor data.

A42, the apparatus of claim a41, the address parameters of the tensor data include a reference address of a data reference point of the descriptor in an overall storage area in which the tensor data is located;

wherein the shape parameters of the tensor data comprise at least one of:

the size of the whole storage area where the tensor data are located in at least one direction of N dimensional directions, the size of the storage area of the tensor data in at least one direction of the N dimensional directions, the offset of the storage area of the tensor data in at least one direction of the N dimensional directions, the positions of at least two vertexes at diagonal positions of the N dimensional directions relative to the data reference point, and the mapping relation between the data description position of the tensor data indicated by the descriptor and the data address.

A43, an artificial intelligence chip, said chip comprising a processor operating means according to any one of claims a22-a 42.

A44, an electronic device comprising the artificial intelligence chip of claim A43.

A45, a board card, comprising: a memory device, an interface device and a control device and an artificial intelligence chip according to claim a 43;

wherein, the artificial intelligence chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the artificial intelligence chip and external equipment;

and the control device is used for monitoring the state of the artificial intelligence chip.

A46, the card of claim a45, the memory device comprising: the artificial intelligence chip comprises a plurality of groups of storage units, wherein each group of storage unit is connected with the artificial intelligence chip through a bus, and the storage units are as follows: DDR SDRAM;

the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;

the interface device is as follows: a standard PCIE interface.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

37页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:处理器操作方法及装置以及相关产品

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!