Storage and calculation integrated chip and data processing method

文档序号：1963673 发布日期：2021-12-14 浏览：15次中文

阅读说明：本技术 存算一体芯片及数据处理方法 (Storage and calculation integrated chip and data processing method ) 是由吴华强江燕海姚鹏高滨唐建石钱鹤于 2021-09-22 设计创作，主要内容包括：一种存算一体芯片及应用于该存算一体芯片的数据处理方法。该存算一体芯片包括：多个处理单元以及主控单元。每个处理单元包括计算单元以及控制单元。计算单元配置为存储参数矩阵、接收输入数据并采用参数矩阵对输入数据进行计算得到输出数据；控制单元配置为与计算单元相通信,且配置为将参数矩阵写入计算单元得到配置后的计算单元,并且控制配置后的计算单元对输入数据进行计算。主控单元配置为与多个处理单元通信,且根据整体参数集合将分别用于多个处理单元的参数矩阵提供至多个处理单元各自的控制单元。该存算一体芯片利用高效的分布式架构,在保证通用性和灵活性的同时,实现高算力和高能效。(A storage and calculation integrated chip and a data processing method applied to the storage and calculation integrated chip are provided. The integrated storage and calculation chip comprises: a plurality of processing units and a master control unit. Each processing unit includes a computing unit and a control unit. The calculation unit is configured to store a parameter matrix, receive input data and calculate the input data by adopting the parameter matrix to obtain output data; the control unit is configured to communicate with the calculation unit, and configured to write the parameter matrix into the calculation unit to obtain the configured calculation unit, and control the configured calculation unit to calculate the input data. The master control unit is configured to communicate with the plurality of processing units and provide parameter matrices for the plurality of processing units, respectively, to respective control units of the plurality of processing units according to the overall parameter set. The integrated storage and calculation chip utilizes an efficient distributed architecture, and achieves high calculation power and high energy efficiency while guaranteeing universality and flexibility.)

1. A computer-integrated chip of a distributed control architecture, comprising:

a plurality of processing units, each processing unit comprising:

the calculation unit is configured to store a parameter matrix, receive input data and calculate the input data by adopting the parameter matrix to obtain output data; and

the control unit is configured to be communicated with the calculation unit, write the parameter matrix into the calculation unit to obtain the configured calculation unit, and control the configured calculation unit to calculate the input data by adopting the parameter matrix; and

a master control unit configured to communicate with the plurality of processing units and provide parameter matrices for the plurality of processing units, respectively, to respective control units of the plurality of processing units according to an overall network parameter set.

2. The credit chip of claim 1, further comprising:

a first storage unit configured to communicate with the plurality of processing units and with the main control unit, store overall input data and/or overall output data, provide input data required by the plurality of processing units to the plurality of processing units, respectively, and/or receive output data generated by the plurality of processing units, respectively, from the plurality of processing units under the control of the main control unit,

wherein the overall input data comprises input data for the plurality of computing units and the overall output data comprises output data for the plurality of computing units.

3. The credit chip of claim 2, further comprising:

an on-chip interconnect configured to connect the master unit, the first memory unit, and the plurality of processing units to communicate.

4. The memory-integrated chip of any one of claims 1-3, wherein the computing unit comprises:

at least one memristor array, wherein the at least one memristor array is configured to store the parameter matrix and to compute the input data.

5. The depository integrative chip of claim 4, wherein the control unit comprises:

and the parameter configuration unit is configured to configure the parameter matrix into the calculation unit and/or acquire the value of the parameter matrix currently possessed by the calculation unit from the calculation unit.

6. The credit chip of claim 5, wherein the control unit further comprises:

a read control unit configured to control reading of the input data to supply the input data into the calculation unit;

a write control unit configured to control output of the output data.

7. The credit all chip of any one of claims 1-3, wherein each of the processing units further comprises a functional unit,

wherein the functional unit is configured to process the calculation result of the calculation unit to obtain the output data.

8. The depository integrated chip of claim 7, wherein the functional unit comprises:

the shift addition unit is configured to acquire a calculation result from the calculation unit and perform shift operation and addition operation on the calculation result; or

An activation unit configured to perform activation function processing on the calculation result; or

And the pooling unit is configured to perform pooling processing on the calculation result.

9. The depository-integrated chip of claim 2 or 3, wherein each of the processing units further comprises:

a second storage unit configured to acquire the input data from the first storage unit or transmit the output data to the first storage unit;

an input buffer unit configured to buffer the input data;

an output buffer unit configured to buffer the output data.

10. The depository-integrated chip of claim 2 or 3, wherein each of the processing units further comprises:

an interconnect interface configured to communicate with the master control unit and the first storage unit.

11. A data processing method applied to the integrated memory chip of any one of claims 1 to 10, the data processing method comprising:

receiving, by the master control unit, the overall network parameter set, and providing, according to the overall parameter set, parameter matrices respectively used for the plurality of processing units to respective control units of the plurality of processing units;

the control unit of each processing unit configures the received parameter matrix into the computing unit of each processing unit;

providing the input data to the plurality of processing units;

calculating, by the plurality of processing units, the input data independently from the received input data using the received parameter matrix to obtain the output data.

Technical Field

The embodiment of the disclosure relates to a storage and calculation integrated chip and a data processing method applied to the storage and calculation integrated chip.

Background

The recognition rate of various deep learning algorithms is continuously improved, and the data volume and the operation volume are exponentially increased. The memristor-based storage and computation integrated technology is expected to break through the bottleneck of a von Neumann architecture of a classic computing system, bring about explosive growth of hardware computational power and energy efficiency, further promote development and landing of artificial intelligence, and is one of the most potential next-generation hardware chip technologies. The memristor-based storage and calculation integrated system is expected to become a hardware carrier for post-molar intelligent calculation breaking through the bottleneck of von Neumann by virtue of the architectural advantages of high performance and low power consumption.

Disclosure of Invention

At least one embodiment of the present disclosure provides a storage and computation integrated chip of a distributed control architecture, including: a plurality of processing units, each processing unit comprising: the calculation unit is configured to store the parameter matrix, receive input data and calculate the input data by adopting the parameter matrix to obtain output data; the control unit is configured to be communicated with the calculation unit, write the parameter matrix into the calculation unit to obtain the configured calculation unit, and control the configured calculation unit to calculate the input data by adopting the parameter matrix; and a master control unit configured to communicate with the plurality of processing units and provide the parameter matrices for the plurality of processing units, respectively, to respective control units of the plurality of processing units according to the overall network parameter set.

For example, at least one embodiment of the present disclosure provides a storage and computation integrated chip further including: the first storage unit is configured to communicate with the plurality of processing units and with the main control unit, store overall input data and/or overall output data, respectively provide the input data required by the plurality of processing units to the plurality of processing units under the control of the main control unit, and/or respectively receive the output data generated by the plurality of processing units from the plurality of processing units, wherein the overall input data comprises input data for the plurality of computing units, and the overall output data comprises output data of the plurality of computing units.

For example, at least one embodiment of the present disclosure provides a storage and computation integrated chip further including: an on-chip interconnect configured to connect the master unit, the first memory unit, and the plurality of processing units to communicate.

For example, in a storage and computation integrated chip provided in at least one embodiment of the present disclosure, a computation unit includes: at least one memristor array, wherein the at least one memristor array is configured to store a parameter matrix and to compute input data.

For example, in a storage and computation integrated chip provided in at least one embodiment of the present disclosure, the control unit includes: and the parameter configuration unit is configured to configure the parameter matrix into the calculation unit and/or acquire the value of the parameter matrix currently possessed by the calculation unit from the calculation unit.

For example, in the storage and computation integrated chip provided in at least one embodiment of the present disclosure, the control unit further includes: a read control unit configured to control reading of the input data to supply the input data into the calculation unit; a write control unit configured to control output of the output data.

For example, in the storage and computation integrated chip provided in at least one embodiment of the present disclosure, each processing unit further includes: a functional unit, wherein the functional unit is configured to process the calculation result of the calculation unit to obtain the output data.

For example, in a memory integrated chip provided in at least one embodiment of the present disclosure, a functional unit includes: the shift addition unit is configured to acquire the calculation result from the calculation unit and carry out shift operation and addition operation on the calculation result; or the activation unit is configured to perform activation function processing on the calculation result; or a pooling unit configured to pool the calculation results.

For example, in the storage and computation integrated chip provided in at least one embodiment of the present disclosure, each processing unit further includes: a second storage unit configured to acquire input data from the first storage unit or transmit output data to the first storage unit; an input buffer unit configured to buffer input data; an output buffer unit configured to buffer the output data.

For example, in the storage and computation integrated chip provided in at least one embodiment of the present disclosure, each processing unit further includes: an interconnect interface configured to communicate with the master control unit and the first storage unit.

At least one embodiment of the present disclosure provides a data processing method applied to a storage and computation integrated chip provided in at least one embodiment of the present disclosure, where the data processing method includes: receiving, by the master control unit, an overall network parameter set, and providing parameter matrices respectively for the plurality of processing units to respective control units of the plurality of processing units according to the overall parameter set; the control unit of each processing unit configures the received parameter matrix into the calculation unit of each processing unit; providing input data to a plurality of processing units; the input data is independently calculated by the plurality of processing units according to the received input data by using the received parameter matrix to obtain output data.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

FIG. 1A shows a schematic diagram of a matrix-vector multiplication;

FIG. 1B is a schematic block diagram of a memristor array provided in at least one embodiment of the present disclosure;

FIG. 2A shows a schematic diagram of a control flow for an architecture with a master unit as a central scheduling unit;

fig. 2B illustrates a schematic diagram of a distributed control architecture based on a bus mechanism provided by at least one embodiment of the present disclosure;

fig. 2C illustrates a schematic diagram of a control flow of a distributed control architecture provided by at least one embodiment of the present disclosure;

FIG. 2D illustrates a schematic diagram of the control flow of a traditional computational-integrated architecture under an on-chip routing mechanism and the distributed control architecture provided by the present disclosure;

fig. 3A is a schematic diagram of a storage-computation-integrated chip of a distributed control architecture according to at least one embodiment of the present disclosure;

FIG. 3B shows a schematic diagram of a data flow of a functional unit, exemplified by a memristor array outputting 8 bits;

fig. 4 is a flowchart illustrating a data processing method according to at least one embodiment of the present disclosure;

fig. 5 illustrates a work flow diagram of a storage and computation integrated chip provided by at least one embodiment of the present disclosure;

FIG. 6 illustrates a flow chart of an exemplary read control process;

fig. 7 shows a schematic diagram of reading data from the second storage unit to the input buffer unit.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right": etc. are used only to indicate relative positional relationships, which may also change accordingly when the absolute position of the object being described changes.

To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of some known functions and components have been omitted from the present disclosure.

Memristors (e.g., resistive random access memories, phase change memories, conductive bridge memories, etc.) are non-volatile devices whose conduction state can be adjusted by applying an external stimulus. The memristor is a two-terminal device, has the characteristics of adjustable resistance and non-volatilization, and is widely applied to the integrated storage and computation technology. According to kirchhoff current law and ohm law, an array formed by memristors can complete multiplication and accumulation calculation in parallel, and storage and calculation both occur in each memristor of the memristor array. Based on the computing architecture, the storage and computation integrated computing without a large amount of data movement can be realized.

At present, a cross array formed by memristors can complete matrix vector multiplication operation in parallel and is widely applied to an accelerated neural network algorithm, so that a memory-computation integrated chip can complete computation by adopting the memristor array. The existing storage and computation integrated chip is usually designed by a specific neural network-oriented customized architecture, can realize high-efficiency processing on a target task, but cannot effectively process multiple algorithm models simultaneously due to the absence of a high-efficiency data flow control unit.

Fig. 1A shows a schematic diagram of a matrix-vector multiplication. As shown in FIG. 1A, a vector X (e.g., including X as shown in FIG. 1A) is input₁、x₂、……、x_m) And a parameter matrix G (e.g., including G as shown in FIG. 1A)₁₁、g₁₂、……、g_1m，……，g_n1、g_n2、……、g_nm) The multiplication results in an output vector I (e.g., including I as shown in FIG. 1A)₁、I₂、……、I_n). For example, I₁＝x₁g₁₁+x₂g₁₂+…+x_mg_1m。

Fig. 1B is a schematic structural diagram of a memristor array provided in at least one embodiment of the present disclosure.

As shown in fig. 1B, the memristor array structure may include a plurality of memristors arranged in an array, an input vector X (for example, the input vector may be a voltage with an encoded amplitude, width, or number of pulses) shown in fig. 1A is provided to an input end of the memristor array, a parameter matrix G shown in fig. 1A is encoded into memristor conductance values, an output current I is obtained by a high-parallel and low-power-consumption array read operation, that is, the output vector I shown in fig. 1A, so that a multiplication-accumulation calculation common in deep learning may be implemented, and further, matrix vector multiplication is accelerated.

For example, according to kirchhoff's law, the output current of a memristor array structure may be derived according to the following formula: i ═ gxx. For example, I₁＝x₁g₁₁+x₂g₁₂+…+x_mg_1m. The multiplication and accumulation calculation process is realized by using a physical law, is different from a digital circuit realization mode of Boolean logic, does not need frequent access and movement of weight data, solves the Von Neumann bottleneck of a classical calculation system, and can realize an intelligent calculation task with high calculation power and high energy efficiency.

The memristors arranged in the array in the memristor array shown in fig. 1B may be in a 1T1R structure or a 2T2R structure, where the memristor unit in the 1T1R structure includes one switching transistor and one memristor, and the memristor unit in the 2T2R structure includes two switching transistors and two memristors, and the embodiments of the present disclosure do not limit the types of the memristors used.

It should be noted that the transistors used in the embodiments of the present disclosure may be thin film transistors or field effect transistors (e.g., MOS field effect transistors) or other switching devices with the same characteristics. The source and drain of the transistor used herein may be symmetrical in structure, so that there may be no difference in structure between the source and drain. Embodiments of the present disclosure do not limit the type of transistors employed.

In order to realize the universality, a structure based on a main control unit can be adopted, the main control unit is used as a central scheduling unit to complete the control, the configuration and the scheduling of a plurality of computing units, and the computing units do not have the scheduling function. However, in a master unit-based architecture, frequent scheduling and data interaction between the master unit and the compute unit can consume a significant amount of time, resulting in computational inefficiency.

Fig. 2A shows a schematic diagram of a control flow of an architecture with a master unit as a central scheduling unit.

As shown in fig. 2A, in the architecture with the master control unit 102 as the central scheduling unit, data scheduling and task allocation are performed by the master control unit 102, and calculation is performed by the calculation unit 103. The dashed arrows in fig. 2A represent the control flow of the central control architecture, which flows from the master unit 102 through the bus 107 to the compute unit 103. In the data-centric neural network algorithm, frequent scheduling and data interaction between the master control unit 102 and the computing unit 103 consumes a lot of time, resulting in low computational efficiency.

At least one embodiment of the present disclosure provides a storage and computation integrated chip of a distributed control architecture. The integrated storage and calculation chip comprises: a plurality of processing units, each processing unit comprising: the calculation unit is configured to store the parameter matrix, receive input data and calculate the input data by adopting the parameter matrix to obtain output data; the control unit is configured to be communicated with the calculation unit, write the parameter matrix into the calculation unit to obtain the configured calculation unit, and control the configured calculation unit to calculate the input data by adopting the parameter matrix; and a master control unit configured to communicate with the plurality of processing units and provide the parameter matrices for the plurality of processing units, respectively, to respective control units of the plurality of processing units according to the overall network parameter set.

At least one embodiment of the present disclosure further provides a data processing method applied to the storage and computation integrated chip.

The storage and computation integrated chip of the distributed control architecture provided by the embodiment of the disclosure can utilize a high-efficiency distributed architecture, and realizes high computing power and high energy efficiency while ensuring universality and flexibility.

Compared with a storage and computation integrated chip adopting a central control architecture, the storage and computation integrated chip provided by the embodiment of the disclosure adopts a distributed control architecture, that is, a local control unit is integrated in the conventional computing unit 103 shown in fig. 2A, so as to form a processing unit, but in the distributed control architecture provided by the disclosure, the data scheduling and task allocation work is completed by the local control unit, and the computing efficiency is greatly improved through parallel distributed scheduling and configuration.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.

Fig. 2B illustrates a schematic diagram of a distributed control architecture based on a bus mechanism provided by at least one embodiment of the present disclosure.

As shown in fig. 2B, the main control unit 102, the first storage unit 106, and the plurality of processing units 101 are connected via a bus 107, and the control unit 104 (see fig. 2C) is integrated inside the plurality of processing units 101, so that local control, flexible matrix parameter configuration and data flow scheduling can be realized, and work tasks are distributed to each processing unit 101 for execution, thereby overcoming the inefficiency of the central control architecture shown in fig. 2A while taking both flexibility and versatility into consideration.

Fig. 2C is a schematic diagram of the control flow of the distributed control architecture shown in fig. 2B.

As shown in fig. 2C, each processing unit 101 includes a computing unit 103 and a control unit 104, and compared with fig. 2A, the computing unit 103 does not perform data scheduling and task allocation directly by the main control unit 102, but performs data scheduling and task allocation by the control unit 104 corresponding thereto in the same processing unit. The dashed arrows represent control flow, which for the example of fig. 2C is internal to processing unit 101 for a distributed control architecture, which can improve computational efficiency and energy efficiency.

In fact, the plurality of processing units 101, the master control unit 102, and the first storage unit 106 may communicate with each other based on other on-chip interconnection mechanisms besides communication via a bus. The on-chip interconnection mechanism includes a switch network, an on-chip network (on-chip routing), and the like, and the embodiments of the present disclosure do not limit this.

Fig. 2D shows a schematic diagram of the control flow of a conventional computational-integrated architecture and the distributed control architecture provided by the present disclosure, under an on-chip routing mechanism. In the routing system on chip shown in fig. 2D, the plurality of processing units 101, the master control unit 102, and the first storage unit 106 are respectively connected to a specific router at a network node via a network interface, and convert communication between these units into communication between routers. Compared with the communication mode of other systems on chip, the on-chip routing mechanism is not limited by a bus architecture any more, so that any number of routing nodes can be expanded. In addition, when the functions of the on-chip routing system need to be expanded, only the designed functional modules need to be accessed through the network interfaces, and the overall architecture of the network does not need to be redesigned. For a conventional storage-and-computation-integrated architecture, a control flow is shown as a dashed arrow labeled 1, and the control flow passes from a routing node connected to the master control unit 102 through the plurality of processing units 101 to the processing unit 101 that the master control unit 102 needs to control. For the distributed architecture provided by the present disclosure, the control flow is as shown by a dashed arrow labeled 2, and the control flow is inside the processing unit 101, thereby avoiding frequent scheduling and data interaction between the main control unit 102 and the processing unit 101, and improving the computational efficiency.

Because of the high efficiency of the distributed control architecture, the present disclosure proposes a computationally integrated chip employing a distributed control architecture, for example, for artificial intelligence operations.

Fig. 3A is a schematic diagram of a storage chip 100 with a distributed control architecture according to at least one embodiment of the present disclosure.

As shown in fig. 3A, the all-in-one chip 100 is implemented by a circuit (a digital circuit and an analog circuit), and includes a plurality of processing units 101 (only one processing unit is shown in the figure) and a main control unit 102, and each processing unit 101 includes a computing unit 103 and a control unit 104. The calculation unit 103 is configured to store a parameter matrix, receive input data, and calculate the input data using the parameter matrix to obtain output data, the control unit 104 is configured to communicate with the calculation unit 103, and is configured to write the parameter matrix into the calculation unit 103 to obtain the configured calculation unit 103, and control the configured calculation unit 103 to calculate the input data using the parameter matrix, the main control unit 102 is configured to communicate with the plurality of processing units 101, and provide the parameter matrices respectively used for the plurality of processing units 101 to the respective control units 104 of the plurality of processing units 101 according to an overall network parameter set.

According to at least one embodiment of the present disclosure, the computation unit 103 includes at least one memristor array 105, the at least one memristor array 105 being configured to store a parameter matrix and to compute input data.

For example, the memristor array 105 is structured as shown in fig. 1B, and a parameter matrix, such as the parameter matrix G shown in fig. 1A, is stored in the form of electrical conductance in at least one memristor array 105 in the computing unit 103. The calculation unit 103 receives input data (e.g., input data arrangement corresponding to an input vector X as shown in fig. 1A), and performs calculation (e.g., matrix vector multiplication operation shown in fig. 1A) on the input data using a parameter matrix to obtain output data (e.g., corresponding to an output vector I as shown in fig. 1A).

In this embodiment, in the case where the computing unit 103 includes a plurality of memristor arrays 105, the number of rows and columns of each memristor array 105 may be the same, however, the number of rows and columns of each memristor array 105, and the number of memristor arrays 105 may be freely designed according to practical situations. Since the calculation in the memristor array is realized based on analog voltage and analog current, and the transmission of data in the chip is transmitted in the form of digital signals, one way of realizing the calculation is to introduce peripheral circuits such as a digital-to-analog converter (DAC) and an analog-to-digital converter (ADC) at the input end and the output end of the memristor array respectively; meanwhile, in order to save power consumption and area overhead, the DAC and the ADC can be designed in a time division multiplexing mode, and the requirements and design of the DAC and the ADC are not described in the disclosure.

According to an embodiment of the present disclosure, the control unit 104 includes a parameter configuration unit 1041, and the parameter configuration unit 1040 is configured to configure the parameter matrix into the computing unit 103, and/or obtain the value of the parameter matrix currently possessed by the computing unit 103 from the computing unit 103 as needed, so that the stored conductance value can be checked.

For example, the control unit 104 maps the part of the parameter matrix corresponding to the memristor array 105 to the plurality of memristors of the memristor array 105 based on the mapping relationship between the parameter of the parameter matrix and the memristor array 105 in the calculation unit 103, thereby implementing the configuration of the parameter matrix into the calculation unit 103.

For example, for the memristor array 105, the included memristors face reliability problems, have inevitable fluctuation, noise, state drift and the like, and therefore cause calculation errors and influence the normal function of the system. When the conductance value of the memristor is used for calculation, due to non-ideal characteristics such as fluctuation of the memristor, for example, random fluctuation, relaxation characteristics, retention characteristics and the like of the memristor, the actual conductance value deviates from the ideal conductance value, so that the calculation result deviates. Therefore, the required parameter configuration unit 1040 acquires the value of the parameter matrix currently possessed by the calculation unit 103 from the calculation unit 103, and compares it with the value of the parameter matrix actually required to be configured to correct the value of the parameter matrix in the calculation unit 103.

According to an embodiment of the present disclosure, the control unit 104 further includes a read control unit 1042 and a write control unit 1043, the read control unit 1042 is configured to control reading of input data to provide the input data into the computing unit 103, and the write control unit 1043 is configured to control output of output data. For example, as shown in fig. 6, a process of reading input data by the read control unit 1042 mainly includes four parts: read address update, data read, input buffer cell fill, input bit control, as will be described in detail in fig. 6.

For example, in both the reading of input data and the output of output data, a memory cell is required to store the data. According to an embodiment of the present disclosure, the all-in-one chip 100 further includes a first storage unit 106, the first storage unit 106 being configured to communicate with the plurality of processing units 101 and with the main control unit 102, store all-in-one input data and/or all-in-one output data, provide input data required by the plurality of processing units 101 to the plurality of processing units 101, respectively, under the control of the main control unit 102, and/or receive output data generated by the plurality of processing units 101, respectively, from the plurality of processing units 101, wherein all-in-one input data includes input data for the plurality of computing units 103, and all-in-one output data includes output data of the plurality of computing units 103. The first storage unit 106 may be a semiconductor storage device, for example, a nonvolatile memory or the like.

For example, the first storage unit 106 may be connected to communicate with the plurality of processing units 101 and the master control unit 102 through a bus, and according to an embodiment of the disclosure, the banked chip 100 further includes an on-chip interconnect, such as a bus 107. The bus is a common communication trunk line for transmitting information among various functional components of the computer, is a transmission line bundle consisting of wires and can be used for transmitting data, data addresses and control signals, and the embodiment of the disclosure does not limit the type and implementation mode of the bus.

According to an embodiment of the present disclosure, each processing unit 101 further comprises an interconnect interface configured to communicate with the master control unit 102 and the first memory unit 106. For example, a connection circuit of the plurality of processing units 101 connected to the bus 107 and the bus 107 is called a bus interface, which is an interface where the plurality of processing units 101 exchange data with the main control unit 102 and with the first storage unit 106; in other on-chip interconnect types, the interconnect interface is another type of interface for data and instruction interaction.

For example, the overall input data refers to a set of input data required by the plurality of processing units 101, and the overall output data refers to a set of output data generated by the plurality of processing units 101. The input data required by each of the plurality of processing units 101 and the output data generated by each require a storage unit inside the plurality of processing units 101 to store.

According to an embodiment of the present disclosure, each processing unit 101 further includes a second storage unit 109, an input buffer unit 110, and an output buffer unit 111, the second storage unit 109 is configured to acquire input data from the first storage unit 106 or transmit output data to the first storage unit 106, the input buffer unit 110 is configured to buffer the input data, and the output buffer unit 111 is configured to buffer the output data. The sizes of the second storage unit 109, the input buffer unit 110, and the output buffer unit 111 can be set according to actual situations.

For example, under the control of the main control unit 102, the first storage unit 106 provides input data to the second storage unit 109 via the bus 107, and the second storage unit 109 may be various types of storage devices, such as a dual-port SRAM (Static Random-Access Memory), which includes a read port and a write port, the write port is a bus interface 107, and the read port is connected to the input buffer unit 110 of the memristor array 105. In order to improve data reading efficiency, a read port of the SRAM may be set to a large bit width (e.g., 512 bits). The input buffer unit 110 temporarily stores data sent from the second storage unit 109, and the output buffer unit 111 temporarily stores output data generated by the memristor array 105, thereby performing a function of coordinating buffering and synchronizing data transfer.

The whole network parameter set is processed by the main control unit and distributed to each processing unit; the overall network parameter set is, for example, all or part of a weight matrix corresponding to a processing layer of a certain neural network, and the weight matrix is split to obtain a parameter matrix for each processing unit.

For example, the overall network parameter set includes, for example, a number of network layers, a convolution kernel size, a network type, and the like, and the main control unit 102 is configured to provide the parameter matrices respectively used for the plurality of processing units 101 to the control units 104 of the plurality of processing units 101 according to the compiled overall network parameter set. Table 1 below lists the network parameters and their meanings that an exemplary specific neural network architecture needs to be configured.

TABLE 1

Here, the concept of some network parameters in table 1 is explained.

Convolutional neural networks are a class of feed-forward neural networks that contain convolutional computations and have a deep structure. The structure of the convolutional neural network mainly comprises an input layer, a hidden layer and an output layer, wherein the hidden layer comprises a convolutional layer, a pooling layer and a full-connection layer 3 common structures. The function of the convolution layer is to perform feature extraction on input data, and the convolution layer internally includes a plurality of convolution kernels. In the terminology of convolutional neural networks, a 3 x 3 matrix is called a "filter", and a matrix obtained by sliding the filter over an image and calculating dot multiplication is called an "eigenmap". Different values of the filter will generate different feature maps for the same input image. In this embodiment, the number of convolutional LAYERS is the number of groups of memristor arrays, each group of memristor arrays is mapped with a parameter matrix corresponding to one convolutional layer, and the range of the number of memristor arrays used by each convolutional layer is 0< i < NUM _ LAYERS (NUM _ LAYERS represents the number of convolutional LAYERS).

The number of channels of the feature map is used to represent a certain composition of the image. For example, an image taken by a standard digital camera will have three channels (red, green and blue), each representing a color, with pixel values in the range 0 to 255 for each channel, and the three channels of the image can be viewed as a two-dimensional matrix of depth 3 stacked together.

Convolutional layer parameters include convolutional kernel size, convolutional step size, and padding, which together determine the size of the convolutional layer's output feature map. The number of convolution kernels is the same as the number of channels of the output characteristic diagram. The convolution kernel size can be specified to be any value smaller than the input image size, e.g., 3, meaning that the height and width of the convolution kernel are both 3, with the larger the convolution kernel size, the more complex the input features that can be extracted. The step of convolution defines the distance between positions of the convolution kernel when the convolution kernel sweeps the feature map twice, when the step of convolution is 1, the convolution kernel sweeps elements of the feature map one by one, and when the step of convolution is n, the convolution kernel skips n-1 pixels in the next scanning. After the feature extraction is performed on the convolutional layer, the output feature map is transmitted to a pooling layer for feature selection and information filtering, the pooling process is a down-sampling process, and common pooling modes include maximum pooling, average pooling and the like. The fully-connected layer is positioned at the last part of the hidden layer of the convolutional neural network, and the fully-connected layer is used for carrying out nonlinear combination on the extracted features to obtain output.

According to an embodiment of the present disclosure, each processing unit 101 further comprises a functional unit 108, the functional unit 108 being configured to process the calculation result of the calculation unit 103 to obtain the output data.

For example, the functional unit 108 includes a shift addition unit 1081 or an activation unit 1082 or a pooling unit 1083, the shift addition unit 1081 is configured to acquire the calculation result from the calculation unit 103, and perform shift operation and addition operation on the calculation result, the activation unit 1082 is configured to perform activation function processing on the calculation result, and the pooling unit 1083 is configured to perform pooling processing on the calculation result.

For example, the input data buffered in the input buffer unit 110 is provided to the corresponding memristor array 105 for calculation, and the input data may be provided to the memristor array 105 in two times, where the first time provides the high bits of the input data and the second time provides the low bits of the input data, and this is designed to ensure the accuracy of the calculation. For example, if the input data is 8 bits, the high 4 bits are first used as the input data for the memristor array 105, and the low 4 bits are then used as the input data for the memristor array 105. The two calculation results output by the memristor array correspond to the calculation result of the high-bit input data and the calculation result of the low-bit input data respectively, and then the two calculation results need to be subjected to shift addition by the shift addition unit 1081, so that the two calculation results are combined into complete output data.

Fig. 3B shows a schematic diagram of a data flow of a functional unit exemplified by a memristor array outputting 8 bits.

As shown in fig. 3B, the two results respectively correspond to 8-bit signed calculation results of the high-bit and low-bit input data, the calculation result of the high-bit input data is shifted left by 4 bits and added to the calculation result of the low-bit input data to obtain a 13-bit signed calculation result, and the 13-bit calculation result represents a complete matrix multiplication result of the array. In this embodiment, a convolutional layer corresponds to a group of memristor arrays, and a group of memristor arrays may include a plurality of memristor arrays, and matrix multiplication results of the plurality of memristor arrays are accumulated to obtain a final calculation result of the convolutional layer. The result is then quantized to a 9-bit signed computation result by operations such as right shifting and Clip (limiting the result to a certain range). Then, activation function processing is performed, and the result is changed into an 8-bit unsigned calculation result. And finally, performing pooling treatment, wherein the pooling result is sent to the first storage unit or the second storage unit through the writing control unit.

Activation functions and pooling are concepts in neural networks. If the activation function is not introduced, the output of each layer in the neural network is a linear function of the input of the previous layer, and the output is a linear combination of the inputs no matter how many layers the neural network has, so that the verification is easy. A non-linear function is therefore introduced as the activation function, so that the output of the neural network is no longer a linear combination of inputs, and an arbitrary function can be approximated. In this embodiment, the activation unit 1082 performs activation function processing on the calculation result to increase the nonlinearity of the calculation result. The pooling process is a down-sampling process, common pooling methods include maximum pooling, average pooling and the like, and the pooling unit 1083 pools the calculation results to reduce parameters and calculation quantity in the neural network more controllably, so that overfitting can be avoided.

It should be noted that the functional unit 108 in fig. 3A is only an example, and the disclosed embodiment is not limited to a specific implementation, and the shift adding unit 1081, the activating unit 1082, and the pooling unit 1083 may be added or reduced according to actual needs. The functional unit 108 may include other units than the shift addition unit 1081, the activation unit 1082, and the pooling unit 1083, or only some of the shift addition unit 1081, the activation unit 1082, and the pooling unit 1083, or none of them.

Fig. 4 is a flowchart illustrating a data processing method provided by at least one embodiment of the present disclosure, where the data processing method is applied to the computer-integrated chip illustrated in fig. 3A.

As shown in fig. 4, the data processing method includes steps S411 to S414.

Step S411: the master control unit receives the overall network parameter set and provides the parameter matrixes respectively used for the processing units to the respective control units of the processing units according to the overall network parameter set.

For example, the contents of the overall network parameter set may refer to the parameter notation and parameter meaning shown in table 1.

Step S412: the control unit of each of the plurality of processing units configures the received parameter matrix into the calculation unit of each of the plurality of processing units.

For example, the calculation unit comprises at least one memristor array, and the control unit maps the part of the parameter matrix corresponding to the memristor array to the plurality of memristors of the memristor array based on the mapping relation between the parameter of the parameter matrix and the memristor array in the calculation unit.

Step S413: input data is provided to a plurality of processing units.

For example, the first storage unit is used for storing overall input data, and input data required by each of the plurality of processing units is provided from the first storage unit to the plurality of processing units via the bus under the control of the main control unit.

Step S414: the input data is independently calculated by the plurality of processing units according to the received input data by using the received parameter matrix to obtain output data.

For example, in each processing unit, the structure of at least one memristor array included in the computing unit is as shown in fig. 1B, a received parameter matrix is stored in the memristor array, and the received parameter matrix is used to compute the input data to obtain the output data.

Fig. 5 shows a work flow diagram of a storage and computation integrated chip provided in at least one embodiment of the present disclosure.

As shown in fig. 5, first, the storage-integration chip is in an initial state, and determines whether the entire network parameter set has been compiled, and if so, the main control unit provides the parameter matrices respectively used for the plurality of processing units to the respective control units of the plurality of processing units according to the compiled entire network parameter set.

Next, the parameter matrix is arranged in the calculation unit of each of the plurality of processing units by the control unit of each of the plurality of processing units.

Then, a read control unit in the control unit of each of the plurality of processing units performs read control so that input data is read to at least one memristor array in the calculation unit of each of the plurality of processing units.

Next, the read input data is calculated by at least one memristor array in a calculation unit of each of the plurality of processing units.

And then, providing the calculation result of at least one memristor array to the functional unit for calculation to obtain output data.

Then, the write control unit in the control unit of each of the plurality of processing units performs write control.

For example, in the write control process, the data of the first NUM _ kernel (number of convolution KERNELS) channels output by the pooling unit may be stored in the output buffer unit (the data of other channels are all invalid data), and then written to the second storage unit or the first storage unit through the bus interface. If the last convolutional layer is being calculated, the output data is written to the first memory unit, otherwise it is written to the second memory unit.

And finally, judging whether the convolution layer is the last layer, if so, resetting the storage and calculation integrated chip, and otherwise, continuing to read and control the next layer of convolution layer.

FIG. 6 illustrates a flow chart of an exemplary read control process. As shown in fig. 6, the process mainly includes four parts: read address update, data read, input buffer cell fill, input bit control, and pseudo code describing the various parts is also shown in FIG. 6.

First, the control unit of each of the plurality of processing units gives the first address (StartA) of the input data, and updates the first address of the input data every time a complete matrix operation is calculated (in the pseudo code, SIZE _ IMAGE indicates the SIZE of the input picture, K indicates the SIZE of the convolution kernel, STRIDE indicates the step of convolution, V indicates the number of channels of the picture, and L ═ V × SIZE _ IMAGE, the meaning of these parameters has already been described above). When the computation of one convolutional layer is completed, the read address of the next convolutional layer is updated, and the parameters of SIZE _ IMAGE, K, STRDE, V, and L mentioned above are updated to the values corresponding to the next convolutional layer.

Then, when the ready signal in the second memory cell corresponding to the read address is asserted, the data read operation is started. The process of data reading is shown in fig. 7.

In fig. 7, input data (input pictures) are stored in the second storage unit 109 in a row-by-row (or column-by-column) manner, for a convolution kernel with a size of K × K, K data (each data read is characterized by a data head address and a data length) need to be read in total, one data has a length of K × V (V corresponds to the number of channels of the input pictures), and each data corresponds to the input data of each row in the convolution kernel.

The second memory unit 109 is divided into a plurality of units (called blocks), the number of bytes of data written in the second memory unit 109 is counted, and when the number of bytes of data written in the second memory unit is equal to the block length or the last data is written, the ready signal of the block is asserted, and the block is not allowed to be written. The purpose of this design is to ensure that valid data can be read in time during the read control process of the control unit, theoretically, the smaller the cell length of the second storage unit 109 is, the better the cell length is (the time for obtaining valid data can be shortened), but this increases the area and power consumption of the second storage unit 109, and therefore, the cell length of the second storage unit 109 should be reasonably divided.

When data is read from the second storage unit 109, the control unit calculates the first address and the last address of the currently required data, then inquires the ready signals of the blocks corresponding to the first address and the last address, executes data reading operation when the ready signals of the blocks are all valid, and then starts the calculation operation of the corresponding memristor array. When the data of a certain block of the second storage unit 109 is not used any more in the calculation process, the control unit initiates a clearing operation of the block and sets the ready signal of the block to be invalid, and the block allows new data to be written.

Returning to fig. 6, next, the input buffer cells are filled. Each memristor array has a corresponding input buffer unit, and K data are evenly distributed into the input buffer unit of each memristor array required by the convolutional layer calculation (NUM _ ARRAYS in the pseudo code indicates the number of memristor ARRAYS required by the convolutional layer calculation).

And finally, the data in the input buffer unit is provided to the corresponding memristor array for calculation, the data can be input twice, the high bit is input for the first time, and the low bit is input for the second time.

The integrated storage chip of the distributed control architecture of the embodiment of the disclosure can be used for other electronic devices, such as an artificial intelligence processor (AI processor), and the like, and not only has universality and flexibility, but also can realize high computing power and high energy efficiency.

For the present disclosure, there are also the following points to be explained:

(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.

(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

22页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种适用于信创自助终端文件转换的系统、方法及介质

Storage and calculation integrated chip and data processing method

相关技术

网友询问留言