Circuit, corresponding device, system and method

文档序号：168653 发布日期：2021-10-29 浏览：37次中文

阅读说明：本技术 电路、对应的设备、系统和方法 (Circuit, corresponding device, system and method ) 是由 G·博尔戈诺沃 L·雷菲奥伦汀于 2021-04-28 设计创作，主要内容包括：本公开的实施例涉及电路、对应的设备、系统和方法。一种实施例电路包括：输入端子集合,被配置为接收携带输入数据的输入数字信号；输出端子集合,被配置为提供携带输出数据的输出数字信号；以及计算电路装置,被配置为根据输入数据产生输出数据。计算电路装置包括乘法器电路集合、加法器-减法器电路集合、累加器电路集合以及可配置的互连网络。可配置的互连网络被配置为在至少两种处理配置中选择性地耦合乘法器电路、加法器-减法器电路、累加器电路、输入端子以及输出端子。在第一配置中,计算电路装置被配置为根据第一功能集合计算输出数据,并且在第二配置中,计算电路装置被配置为根据不同的功能集合计算输出数据。(Embodiments of the present disclosure relate to circuits, corresponding devices, systems, and methods. An embodiment circuit includes: a set of input terminals configured to receive an input digital signal carrying input data; an output terminal set configured to provide an output digital signal carrying output data; and computing circuitry configured to generate output data from the input data. The computational circuit arrangement includes a set of multiplier circuits, a set of adder-subtractor circuits, a set of accumulator circuits, and a configurable interconnection network. The configurable interconnection network is configured to selectively couple the multiplier circuit, the adder-subtractor circuit, the accumulator circuit, the input terminal, and the output terminal in at least two processing configurations. In a first configuration, the computation circuitry is configured to compute the output data in accordance with a first set of functions, and in a second configuration, the computation circuitry is configured to compute the output data in accordance with a different set of functions.)

1. A circuit, comprising:

a set of input terminals configured to receive a set of respective input digital signals carrying input data;

a set of output terminals configured to provide a set of respective output digital signals carrying output data; and

computing circuitry configured to generate the output data from the input data, wherein the computing circuitry comprises:

a set of multiplier circuits;

a set of adder-subtractor circuits;

a set of accumulator circuits; and

a configurable interconnection network configured to selectively couple the multiplier circuit, the adder-subtractor circuit, the accumulator circuit, the input terminal, and the output terminal in at least two processing configurations;

wherein:

in a first processing configuration, the computation circuitry is configured to compute the output data according to a first set of functions; and

in at least one second processing configuration, the computation circuitry is configured to compute the output data according to a respective second set of functions, the respective second set of functions being different from the first set of functions.

2. The circuit of claim 1, further comprising:

a respective configurable direct-read memory access controller coupled to a first subset of the set of input terminals to receive a respective first subset of the input digital signals carrying a first subset of input data, wherein the configurable direct-read memory access controller is configured to control retrieval of the first subset of input data from a memory; and

a respective configurable direct write memory access controller coupled to the set of output terminals to provide the output digital signal carrying output data, wherein the configurable direct write memory access controller is configured to control storage of the output data into the memory.

3. The circuit of claim 2, further comprising: a respective input buffer register coupled to the configurable direct-read memory access controller and a respective output buffer register coupled to the configurable direct-write memory access controller.

4. The circuit of claim 1, further comprising:

read only memory, ROM, address generator circuitry configured to control retrieval of a second subset of input data from at least one read only memory via the second subset of input digital signals; and/or

A memory address generator circuit configured to control retrieval of a third subset of input data from at least one locally configurable memory via the third subset of input digital signals.

5. The circuit of claim 1, wherein the set of multiplier circuits comprises a first multiplier circuit, a second multiplier circuit, a third multiplier circuit, and a fourth multiplier circuit, the set of adder-subtractor circuits comprises a first adder-subtractor circuit and a second adder-subtractor circuit, the set of accumulator circuits comprises a first accumulator circuit and a second accumulator circuit, and wherein:

the first multiplier circuit receives a first input signal of the set of respective input digital signals as a first operand and a second input signal of the set of respective input digital signals as a second operand;

the second multiplier circuit receives a third input signal of the set of respective input digital signals as a first operand and a selectable signal from a fourth input signal and a fifth input signal of the set of respective input digital signals as a second operand;

the third multiplier circuit receives as a first operand a signal selectable from the output signal from the first multiplier circuit and the second input signal, and receives as a second operand a signal selectable from a sixth input signal, a seventh input signal, and the fifth input signal;

said fourth multiplier circuit receiving as a first operand a signal selectable from an output signal from said second multiplier circuit and said third input signal, and receiving as a second operand a signal selectable from said fifth input signal and said seventh input signal;

the first adder-subtractor circuit receives, as a first operand, a signal selectable from the output signal from the first multiplier circuit, the second input signal, and the output signal from the third multiplier circuit, and receives, as a second operand, a signal selectable from the third input signal, the output signal from the second multiplier circuit, and a zero signal;

the second adder-subtractor circuit receives as a first operand a signal selectable from the output signal from the third multiplier circuit and the output signal from the fourth multiplier circuit, and receives as a second operand a signal selectable from the output signal from the fourth multiplier circuit, the output signal from the second multiplier circuit, and a zero signal;

the first accumulator circuit receives as input the output signal from the first adder-subtractor circuit;

the second accumulator circuit receives as input the output signal from the second adder-subtractor circuit; and

the first accumulator circuit is selectively activatable to provide a first output signal and the second accumulator circuit is selectively activatable to provide a second output signal.

6. The circuit of claim 5, wherein the computation circuitry comprises a set of functional circuits configured to compute a non-linear function.

7. The circuit of claim 6, wherein the set of functional circuits configured to compute a non-linear function comprises: a first circuit configured to calculate a non-linear function, and a second circuit configured to calculate a non-linear function, and wherein:

the first circuit configured to calculate a non-linear function receives as input an output signal from the first accumulator circuit;

the second circuit configured to calculate a non-linear function receives as input the output signal from the second accumulator circuit;

the first output signal is selectable between the output signal from the first accumulator circuit and the output signal from the first circuit configured to calculate a non-linear function; and

the second output signal is selectable between the output signal from the second accumulator circuit and the output signal from the second circuit configured to calculate a non-linear function.

8. An apparatus, comprising:

a set of data storage banks;

a control unit;

an interconnection network; and

a set of circuits, each circuit comprising:

a set of input terminals configured to receive a set of respective input digital signals carrying input data;

a set of output terminals configured to provide a set of respective output digital signals carrying output data; and

computing circuitry configured to generate the output data from the input data, wherein the computing circuitry comprises:

a set of multiplier circuits;

a set of adder-subtractor circuits;

a set of accumulator circuits; and

wherein:

in a first processing configuration, the computation circuitry is configured to compute the output data according to a first set of functions; and

wherein the set of circuits is configurable to read data from and write data to the set of data memories via the interconnection network as a function of configuration data stored in the control unit.

9. The apparatus of claim 8, wherein the data memory bank comprises a buffer register.

10. The apparatus of claim 9, wherein the buffer register is a double buffer register.

11. The apparatus of claim 8, further comprising:

12. The apparatus of claim 11, further comprising: a respective input buffer register coupled to the configurable direct-read memory access controller and a respective output buffer register coupled to the configurable direct-write memory access controller.

13. The apparatus of claim 8, further comprising:

A memory address generator circuit configured to control retrieval of a third subset of input data from at least one locally configurable memory via the third subset of input digital signals.

14. The apparatus of claim 8, wherein the set of multiplier circuits comprises a first multiplier circuit, a second multiplier circuit, a third multiplier circuit, and a fourth multiplier circuit, the set of adder-subtractor circuits comprises a first adder-subtractor circuit and a second adder-subtractor circuit, the set of accumulator circuits comprises a first accumulator circuit and a second accumulator circuit, and wherein:

the first accumulator circuit receives as input the output signal from the first adder-subtractor circuit;

the second accumulator circuit receives as input the output signal from the second adder-subtractor circuit; and

the first accumulator circuit is selectively activatable to provide a first output signal and the second accumulator circuit is selectively activatable to provide a second output signal.

15. The apparatus of claim 14, wherein the computation circuitry comprises a set of functional circuits configured to compute a non-linear function.

16. The apparatus of claim 15, wherein the set of functional circuits configured to compute a non-linear function comprises: a first circuit configured to calculate a non-linear function, and a second circuit configured to calculate a non-linear function, and wherein:

the first circuit configured to calculate a non-linear function receives as input an output signal from the first accumulator circuit;

the second circuit configured to calculate a non-linear function receives as input the output signal from the second accumulator circuit;

the first output signal is selectable between the output signal from the first accumulator circuit and the output signal from the first circuit configured to calculate a non-linear function; and

the second output signal is selectable between the output signal from the second accumulator circuit and the output signal from the second circuit configured to calculate a non-linear function.

17. A system, comprising:

system interconnection;

a processing unit;

a device coupled to a processing unit via the system interconnect, wherein the device comprises:

a set of data storage banks;

a control unit;

an interconnection network; and

a set of circuits, each circuit comprising:

a set of input terminals configured to receive a set of respective input digital signals carrying input data;

a set of output terminals configured to provide a set of respective output digital signals carrying output data; and

computing circuitry configured to generate the output data from the input data, wherein the computing circuitry comprises:

a set of multiplier circuits;

a set of adder-subtractor circuits;

a set of accumulator circuits; and

wherein:

in a first processing configuration, the computation circuitry is configured to compute the output data according to a first set of functions; and

wherein the set of circuits is configurable to read data from and write data to the data memory banks via the interconnection network as a function of configuration data stored in the control unit; and

wherein the set of circuits is configurable in at least two processing configurations according to control signals received from the processing unit.

18. The system of claim 17, wherein the data memory bank comprises a buffer register.

19. The system of claim 18, wherein the buffer register is a double buffer register.

20. A method of operating a circuit, the circuit comprising: a set of input terminals configured to receive a set of respective input digital signals carrying input data; a set of output terminals configured to provide a set of respective output digital signals carrying output data; and computation circuitry configured to generate the output data from the input data, the computation circuitry comprising a set of multiplier circuits, a set of adder-subtractor circuits, a set of accumulator circuits, and a configurable interconnection network configured to selectively couple the multiplier circuits, the adder-subtractor circuits, the accumulator circuits, the input terminals, and the output terminals in at least two processing configurations, the computation circuitry configured to compute the output data from a first set of functions in a first processing configuration, and the computation circuitry configured to compute the output data from a corresponding second set of functions in at least one second processing configuration, the corresponding second set of functions being different from the first set of functions, the method comprises the following steps:

dividing an operating time of the computing circuitry into at least a first operating interval and a second operating interval;

operating the computing circuitry in the first processing configuration in the first operating interval; and

operating the computing circuitry in the at least one second processing configuration in the second operating interval.

Technical Field

The present description relates to digital signal processing circuits, such as hardware accelerators, and related methods, apparatus, and systems.

Background

Various real-time digital signal processing systems (with increasing demand in the automotive field, e.g. for processing video data and/or image data, radar data, wireless communication data) may involve processing a relevant amount of data per unit time. In various applications, such processing may become demanding for purely core-based implementations (i.e., implementations involving a general-purpose microprocessor or microcontroller running processing software).

Thus, the use of hardware accelerators becomes increasingly important in certain areas of data processing because it helps to speed up the computation of certain algorithms. A properly designed hardware accelerator may reduce the processing time for a particular operation compared to a core-based implementation.

Conventional hardware accelerators described in the literature or available as commercial products may include different types of processing elements (also referred to as "mathematical units" or "mathematical operators"), where each processing element is dedicated to the computation of a particular operation. For example, such processing elements may include multiply-and-accumulate (MAC) circuitry and/or circuitry configured to calculate an activation function, such as an activation nonlinear function (ANLF) (e.g., coordinate rotation digital computer (CORDIC) circuitry).

Each of the processing elements described above is typically designed to implement a particular function (e.g., a radix-2 butterfly algorithm, multiplication of complex vectors, vector/matrix products, trigonometric or exponential or logarithmic functions, convolution, etc.). Thus, conventional hardware accelerators typically include a variety of such different processing elements connected together by some sort of interconnection network. In some cases, different processing elements are activated one at a time, resulting in inefficient use of silicon area and available hardware resources due to data dependencies and/or architectural limitations.

On the other hand, a purely software-implemented, core-based approach (e.g., using a Single Instruction Multiple Data (SIMD) processor) may involve a high clock frequency to meet the usual bandwidth requirements of a real-time system, since in this case each processing element performs the basic operations.

Disclosure of Invention

It is an object of one or more embodiments to provide a hardware accelerator device that addresses one or more of the above disadvantages.

In particular, one or more embodiments are directed to providing a memory-based hardware accelerator device (also referred to by the acronym EDPA, enhanced data processing architecture, in the context of the present disclosure) that includes one or more processing elements. The processing elements in the hardware accelerator device may be reconfigured at runtime to provide increased flexibility of use and facilitate efficient computation of various signal processing operations, which may be resource-intensive (e.g., fast fourier transforms, digital filtering, implementation of artificial neural networks, etc.).

One or more embodiments may find application in real-time processing systems where acceleration of operations requiring computation (e.g., vector/matrix products, convolution, FFT, radix-2 butterfly, complex vector multiplication, trigonometric or exponential functions, logarithmic functions, etc.) may help meet certain performance requirements (e.g., in terms of processing time). This may be the case, for example, in the automotive field.

According to one or more embodiments, such an object is achieved by means of a circuit (e.g. a runtime reconfigurable processing unit) having the features set forth in the claims below.

One or more embodiments may relate to a corresponding apparatus (e.g., a hardware accelerator circuit including one or more runtime reconfigurable processing units).

One or more embodiments may be directed to a corresponding system (e.g., a system-on-a-chip integrated circuit including hardware accelerator circuitry).

One or more embodiments may be directed to a corresponding method.

The claims are an integral part of the technical teaching provided herein for the examples.

In accordance with one or more embodiments, a circuit is provided that may include a set of input terminals configured to receive an input digital signal carrying input data; and an output terminal set configured to provide an output digital signal carrying the output data. The circuit may comprise computational circuitry configured to generate output data from input data. The computational circuit arrangement may include a set of multiplier circuits, a set of adder-subtractor circuits, a set of accumulator circuits, and a configurable interconnection network. The configurable interconnection network may be configured to selectively couple the multiplier circuit, the adder-subtractor circuit, the accumulator circuit, the input terminal, and the output terminal in at least two processing configurations. In a first processing configuration, the computation circuitry is configured to compute the output data in accordance with a first set of functions, and in at least one second processing configuration, the computation circuitry is configured to compute the output data in accordance with a corresponding second set of functions. The second set of functions is different from the first set of functions.

Accordingly, one or more embodiments may provide increased flexibility, improved hardware resource usage, and/or improved parallel computing performance.

Drawings

One or more embodiments will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 is an exemplary circuit block diagram of an electronic system (such as a system on a chip) in accordance with one or more embodiments;

FIG. 2 is an exemplary circuit block diagram of an electronic device implementing a hardware accelerator in accordance with one or more embodiments;

fig. 3 is an exemplary circuit block diagram of processing circuitry for an electronic device according to one or more embodiments;

fig. 4 is another exemplary circuit block diagram of processing circuitry for an electronic device according to one or more embodiments; and

fig. 5 is an exemplary diagram of a multi-layered perceptron network structure.

Detailed Description

In the following description, one or more specific details are set forth in order to provide a thorough understanding of the examples of embodiments described herein. Embodiments may be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to not obscure certain aspects of the embodiments.

Reference to "an embodiment" or "one embodiment" in the framework of the description is intended to indicate that a particular configuration, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, phrases such as "in an embodiment" or "in one embodiment" that may be present in one or more points of the present description do not necessarily refer to one or more of the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In the drawings attached hereto, like parts or elements are denoted with like reference/numerals and, for the sake of brevity, the corresponding description will not be repeated.

References used herein are for convenience only and therefore do not define the scope of protection or the scope of the embodiments.

Fig. 1 is an example of an electronic system 1, such as a system on a chip (SoC), in accordance with one or more embodiments. The electronic system 1 may comprise various electronic circuits, such as: a central processing unit 10(CPU, e.g. microprocessor), a main system memory 12 (e.g. system RAM-random access memory), a Direct Memory Access (DMA) controller 14 and a hardware accelerator device 16.

As shown in fig. 1, the electronic circuits in the electronic system 1 may be connected by a system interconnection network 18 (e.g., SoC interconnection).

It is an object of one or more embodiments to provide a (run-time) reconfigurable hardware accelerator circuit 16 designed to support the execution of various (basic) arithmetic functions and having improved flexibility of use. Accordingly, one or more embodiments may help improve the use of silicon regions and provide satisfactory processing performance, for example, to meet the processing time requirements of a real-time data processing system.

As shown in FIG. 1, in one or more embodiments, hardware accelerator circuitry 16 may include at least one (runtime) configurable processing element 160, preferably a number P of (runtime) configurable processing elements 160₀、160₁、…、160_P-1And a set of local data memory banks M, preferably a number Q2P of local data memory banks M₀、…、M_Q-1。

In one or more embodiments, hardware accelerator circuit 16 may further include a local control unit 161, a local interconnect network 162, a local data memory controller 163, a local ROM controller 164, local ROM controller 164 being coupled to a set 165 of local read only memories, preferably a number P of local read only memories 165₀、165₁、…、165_P-1) And a local configuration memory controller 166, (the local configuration memory controller 166 is coupled to a set 167 of local configurable coefficient memories, preferably a number P of local configurable coefficient memories 167₀、167₁、…、167_P-1). For example, the memory 167 may include volatile memory (e.g., RAM memory) and/or non-volatile memory (e.g., PCM memory).

Different embodiments may include different numbers of P processing elements 160 and/or different numbers of Q local data memory banks M₀、…、M_Q-1. For example, P may equal 8 and Q may equal 16.

In one or more embodiments, processing elements 160 may be configured to support different (basic) processing functions with different levels of computational parallelism. For example, processing elements 160 may support (e.g., based on an appropriate static configuration) different types of arithmetic (e.g., floating point single precision 32-bits, fixed point/integer 32-bits, or 16 or 8-bits with parallel computation or vectorization modes).

Processing elements 160 may include respective internal Direct Memory Access (DMA) controllers 168 with low complexity₀、168₁、…、168_P-1. In particular, processing elements 160 may be configured to access local data memory banks M from respective direct memory access controllers 168₀、…、M_Q-1And/or retrieve input data from the main system memory 12. Thus, the processing element 160 may refine the retrieved input data to generate processed output data. The processing elements 160 may be configured to store the processed output data in the local data memory banks M through respective direct memory access controllers 168₀、…、M_Q-1And/or main system memory 12.

Further, the processing element 160 may be configured to retrieve input data from the local read-only memory 165 and/or from the local configurable coefficient memory 167 to perform such refinement.

In one or more embodiments, a set of local data storage banks M is provided₀、…、M_Q-1Parallel processing of data and reduction of memory access conflicts may be facilitated.

Preferably, the local data memory bank M₀、…、M_Q-1A buffer (e.g., double buffer) may be provided that may help recover memory upload time (write operation) and/or download time (read operation). In particular, each local data memory bank may be replicated such that data may be read (e.g., for processing) from one of the two memory banks and at the same time (new) data may be stored (e.g., for later processing) in the other memory bank. Thus, the movement data may not negatively impact computational performance because it may be masked.

In one or more embodiments, local data store group M₀、…、M_Q-1May be advantageous in combination with stream mode or back-to-back data processing (e.g., as applied to an FFT N-point processor configured to set forth a contiguous sequence of N data inputs).

In one or more embodiments, local data store group M₀、…、M_Q-1May include memory banks having limited storage capacity (and therefore limited silicon footprint). In the exemplary case of an FFT processor, each local data memory bank may have a storage capacity of at least (maxN)/Q, where maxN is the longest FFT that the hardware is capable of handling. Typical values in applications involving a hardware accelerator may be as follows:

4096 points, for example each point is a floating-point single-precision complex number (real, imaginary), 64-bits (or 8 bytes) in size,

p-8, resulting in Q-16,

so that the storage capacity of each local data memory bank may be equal to (4096 × 8 bytes)/16 — 2KB (KB — kilobyte).

In one or more embodiments, local control unit 161 may include a register file that includes information for setting the configuration of processing elements 160. For example, the local control unit 161 may set the processing element 160 to execute a particular algorithm as directed by a host application running on the central processing unit 10.

In one or more embodiments, the local control unit 161 may thus comprise controller circuitry of the hardware accelerator circuitry 16. Such controller circuitry may configure (e.g., dynamically) each processing element 160 for computing a particular (basic) function, and may configure a corresponding internal direct memory access controller 168 with a particular memory access scheme and cycle period.

In one or more embodiments, local interconnect network 162 may comprise a low complexity interconnect system, e.g., a bus network based on known types, such as an AXI 4-based interconnect. For example, the data parallelism of the local interconnect network 162 may be 64 bits and the address width may be 32 bits.

The local interconnect network 162 mayTo be configured to connect the processing elements 160 to the local data memory bank M₀、…、M_Q-1And/or main system memory 12. In addition, local interconnect network 162 may be configured to connect local control units 161 and local configuration memory controller 166 to system interconnect network 18.

In particular, interconnection network 162 may include a set of P master ports MP₀、MP₁、…、MP_P-1Each of these primary ports is coupled to a respective processing element 160; p slave port set SP₀、SP₁、…、SP_P-1Each of these slave ports may be coupled to local data memory bank M via local data memory controller 163₀、…、M_Q-1(ii) a The other pair of ports comprises a system main port MP_PAnd system slave port SP_PConfigured to couple to system interconnection network 18 (e.g., to receive instructions from central processing unit 10 and/or to access data stored in system memory 12); and another slave port SP_P+1Coupled to a local control unit 161 and a local configuration memory controller 166.

In one or more embodiments, the interconnection network 162 may be fixed (i.e., non-reconfigurable).

In an exemplary embodiment (e.g., see table I-1 provided below, where the "X" symbol indicates an existing connection between two ports), the interconnection network 162 may implement the following connections: p-master port MP coupled to processing element 160₀、MP₁、…、MP_P-1May be connected to a corresponding slave port SP coupled to the local data memory controller 163₀、SP₁、…、SP_P-1(ii) a And a system master port MP coupled to the system interconnection network 18_PCan be connected to a slave port SP coupled to the local control unit 161_P+1And a local configuration memory controller 166.

Table I-1 provided below summarizes such exemplary connections made through the interconnection network 162.

TABLE I-1

SP₀

SP₁

…

SP_P-1

SP_P

SP_P+1

MP₀

MP₁

…

MP_P-1

MP_P

In another exemplary embodiment (e.g., see table I-2 provided below), the interconnection network 162 may further implement the following connections: p main port MP₀、MP₁、…、MP_P-1May be connected to a system slave port SP coupled to the system interconnection network 18_P. In this manner, connectivity may be provided between any processing element 160 and the SoC via the system interconnection network 18.

Table I-2 provided below summarizes such exemplary connections made through the interconnection network 162.

TABLE I-2

SP₀

SP₁

…

SP_P-1

SP_P

SP_P+1

MP₀

MP₁

…

MP_P-1

MP_P

In another exemplary embodiment (e.g., see table I-3 provided below, where the "X" symbol indicates an existing connection between two ports, and the "X" in parentheses indicates an optional connection), the interconnection network 162 may further implement the following connections: system master port MP coupled to system interconnect network 18_PCan be connected to the slave port SP₀、SP₁、…、SP_P-1At least one slave port (here, the set of P slave ports SP)₀、SP₁、…、SP_P-1Is the first slave port SP₀). In this way, MP can be done at the main port_PAnd (any) slave ports. Depending on the specific application of the system 1, the connection of the master port MPP may be extended to a plurality (e.g. all) of the slave ports SP₀、SP₁、…、SP_P-1. Main port MP_PTo the slave port SP₀、SP₁、…、SP_P-1Can be used (only) for loading input data to be processed into the local data memory bank M₀、…、M_Q-1Because all memory banks are accessible via a single slave port. Loading input data may be done using only one slave port, whereas processing data by means of parallel computations may advantageously use a plurality (e.g. all) of slave ports SP₀、SP₁、…、SP_P-1。

Table I-3 provided below summarizes such exemplary connections made via the interconnection network 162.

TABLE I-3

SP₀

SP₁

…

SP_P-1

SP_P

SP_P+1

MP₀

MP₁

…

MP_P-1

MP_P

(X)

In one or more embodiments, local data memory controller 163 may be configured to arbitrate (e.g., by processing elements 160) for local data memory bank M₀、…、M_Q-1To access (c). For example, local data memory controller 163 may use a memory access scheme (e.g., a calculation for a particular algorithm) that is selectable based on signals received from local control unit 161.

In one or more embodiments, the local data memory controller 163 may convert an input read/write transaction burst (e.g., an AXI burst) generated by the direct read/write memory access controller 168 into a read/write memory access sequence according to a specified burst type, burst length, and memory access scheme.

Thus, one or more embodiments of hardware accelerator circuitry 16 as shown in FIG. 1 may aim to reduce the complexity of local interconnect network 162 by delegating the implementation of (reconfigurable) connections between processing elements and local data memory banks to local data memory controller 163.

In one or more embodiments, local read-only memory 165 accessible by processing element 160 via local ROM controller 164₀、165₁、…、165_P-1May be configured to store digital factors and/or fixed coefficients (e.g., rotation factors or other complex coefficients for FFT computation) for implementing a particular algorithm or operation. The local ROM controller 164 may implement a particular addressing scheme.

In one or more embodiments, a local configurable coefficient memory 167 accessible by processing element 160 via local configuration memory controller 166₀、167₁、…、167_P-1May be configured to store application-dependent digital factors and/or coefficients (e.g., coefficients for implementing FIR filters or beamforming operations, weights of neural networks, etc.) that may be configured by software. The local configuration memory controller 166 may implement a particular addressing scheme.

In one or more embodiments, local read-only memory 165₀、165₁、…、165_P-1And/or local configurable coefficient memory 167₀、167₁、…、167_P-1May advantageously be partitioned into a number P of groups equal to the number of processing elements 160 included in hardware accelerator circuitry 16. This helps to avoid conflicts during parallel computations. For example, each local configurable coefficient memory may be configured to provide the complete set of coefficients needed by each processing element 160 in parallel.

FIG. 2 is a diagram of processing elements 160 and local ROM controller 164, local configuration memory controller 166 and local data memory bank M₀、…、M_Q-1With respect to one or more embodiments of the connection (where the dashed lines schematically indicate the presence of processing elements 160 and local data memory banks M)₀、…、M_Q-1A reconfigurable connection therebetween).

The processing element 160 as shown in fig. 2 may be configured to receive: via corresponding direct read memory access 200₀And a buffer register 202₀A first input signal P (e.g., indicative of a signal from a local data memory bank M) (e.g., a FIFO register)₀、…、M_Q-1A digital signal of binary values, possibly complex data having a real part and an imaginary part); via corresponding direct read memory access 200₁And a buffer register 202₁A second input signal Q (e.g., indicative of a signal from a local data memory bank M) (e.g., a FIFO register)₀、…、M_Q-1A digital signal of binary value of (a), may have complex data of real and imaginary parts); a first input coefficient W0 (e.g., a digital signal representing a binary value from local read only memory 165); and second, third, fourth and fifth input coefficients W1, W2, W3, W4 (e.g., digital signals indicating respective binary values from the local configurable coefficient memory 167).

In one or more embodiments, processing element 160 may include a number of direct read memory accesses 200 equal to the number of input signals P, Q.

It should be understood that the number of input signals and/or input coefficients received at processing element 160 may vary in different embodiments.

The processing element 160 may include a computational circuit 20, and the computational circuit 20 may be configured to process (possibly at runtime) the input value P, Q and the input coefficients W0, W1, W2, W3, W4 to generate a first output signal X0 (e.g., indicative of a signal output via a corresponding direct write memory access 204₀And a buffer register 206₀(such as FIFO registers) to be stored in local data memory banks M₀、…、M_Q-1Binary-valued digital signals) and a second output signal X1 (e.g., indicative of a binary-valued digital signal via a corresponding write direct memory access 204)₁And a buffer register 206₁(such as FIFO registers) to be stored in local data memory banks M₀、…、M_Q-1A binary value of (1).

In one or more embodiments, processing element 160 may include a number of write direct memory accesses 204 equal to the number of output signals X0, X1.

In one or more embodiments, the programming of the read and/or write direct memory accesses 200, 204 (included in the direct memory access controller 168) may be performed via an interface (e.g., an AMBA interface) that may allow access to internal control registers located in the local control unit 161.

Further, the processing element 160 may include a ROM address generator circuit 208 coupled to the local ROM controller 164 and a memory address generator circuit 210 coupled to the local configuration memory controller 166 to manage data retrieved therefrom.

Fig. 3 is an exemplary circuit block diagram of computing circuitry 20 that may be included in one or more embodiments of processing element 160.

As shown in fig. 3, the computation circuit 20 may include a set of processing resources, for example including four complex/real multiplier circuits (30a, 30b, 30c, 30d), two complex adder-subtractor circuits (32a, 32b) and two accumulator circuits (34a, 34b), which may reconstruct the coupling as shown in fig. 3. For example, reconfigurable coupling of processing resources may be achieved by way of multiplexer circuits (e.g., 36a through 36j) to form different data paths, where the different data paths correspond to different mathematical operations, where each multiplexer receives a respective control signal (e.g., S0 through S7).

In one or more embodiments, the multiplier circuits 30a, 30b, 30c, 30d may be configured (e.g., by means of internal multiplexer circuits not visible in the figure) to operate according to two different configurations, which may be selected according to a control signal S8 provided to the multiplier. In a first configuration (e.g., if S8 ═ 0), the multiplier may compute the result of two real-number products on four real operands per clock cycle (i.e., each input signal carries two real values). In a second configuration (e.g., if S8 ═ 1), the multiplier may compute the result of one complex product over two complex operands per clock cycle (i.e., each input signal carries two values, where the first value is the real part of the operand and the second value is the imaginary part of the operand).

Table II provided below summarizes exemplary possible configurations of multiplier circuits 30a, 30b, 30c, 30 d.

TABLE II

By way of example and with reference to fig. 3, the processing resources may be arranged as described below.

The first multiplier 30a may receive a first input signal W1 and a second input signal P (e.g., a complex operand).

The second multiplier 30b may receive the first input signal Q and the second input signal selected from the input signals W2 and W4 via the first multiplexer 36a, the first multiplexer 36a receiving the corresponding control signal S2. For example, if S2 is 0, the multiplier 30b receives the signal W2 as the second input, and if S2 is 1, the multiplier 30b receives the signal W4 as the second input.

The third multiplier 30c may receive a first input signal selected from the output signal from the first multiplier 30a and the input signal P.

For example, as shown in fig. 3, the second multiplexer 36b may provide as an output either the output signal from the first multiplier 30a (e.g., if S0 ═ 0) or the input signal P (e.g., if S0 ═ 1) according to the corresponding control signal S0. The third multiplexer 36c may provide either the output signal (e.g., if S3 ═ 1) or the input signal P (e.g., if S3 ═ 0) from the second multiplexer 36b as an output to the first input of the third multiplier 30c in accordance with the corresponding control signal S3.

The third multiplier 30c may receive a second input signal selected from the input signal W3, the input signal W4, and the input signal W0.

For example, as shown in fig. 3, the fourth multiplexer 36d may provide as an output either the input signal W4 (e.g., if S3 ═ 0) or the input signal W0 (e.g., if S3 ═ 1) according to the respective control signal S3. The fifth multiplexer 36e may provide either the input signal W3 (e.g., if S3 ═ 0) or the output signal from the fourth multiplexer 36d (e.g., if S3 ═ 1) as an output to the second input of the third multiplier 30c in accordance with the respective control signal S3.

The fourth multiplier 30d may receive a first input signal selected from the input signal Q and the output signal from the second multiplier 30 b.

For example, as shown in fig. 3, the sixth multiplexer 36f may provide either the input signal Q (e.g., if S1 ═ 0) or the output signal from the second multiplier 30b (e.g., if S1 ═ 1) as the output to the first input of the fourth multiplier 30d according to the corresponding control signal S1.

The fourth multiplier 30d may receive a second input signal selected from the input signal W4 and the input signal W0.

For example, as shown in fig. 3, a second input of the fourth multiplier 30d may be coupled to an output of a fourth multiplexer 36 d.

The first adder-subtractor 32a may receive a first input signal selected from the output signal from the first multiplier 30a, the input signal P, and the output signal from the third multiplier 30 c.

For example, as shown in fig. 3, the seventh multiplexer 36g may provide either the output signal from the second multiplexer 36b (e.g., if S7 ═ 1) or the output signal from the third multiplier 30c (e.g., if S7 ═ 0) as an output to the first input of the first adder-subtractor 32 a.

The first adder-subtractor 32a may receive a second input signal selected from the input signal Q, the output from the second multiplier 30b, and a zero signal (i.e., a binary signal equal to zero).

For example, as shown in fig. 3, the eighth multiplexer 36h may provide as an output either the input signal Q (e.g., if S6 ═ 0) or the output signal from the second multiplier 30b (e.g., if S6 ═ 1) according to the corresponding control signal S6. The first and gate 38a may receive the output signal from the eighth multiplexer 36h as a first input signal and the control signal G0 as a second input signal. The output of the first and gate 38a may be coupled to a second input of the first adder-subtractor 32 a.

The second adder-subtractor 32b may receive a first input signal selected from the output signal of the third multiplier 30c and the output signal of the fourth multiplier 30 d.

For example, as shown in fig. 3, the ninth multiplexer 36i may provide either the output signal from the third multiplier 30c (e.g., if S5 ═ 0) or the output signal from the fourth multiplier 30d (e.g., if S5 ═ 1) as an output to the first input of the second adder-subtractor 32b according to the corresponding control signal S5.

The second adder-subtractor 32b may receive a second input signal selected from the output from the fourth multiplier 30d, the output from the second multiplier 30b, and a zero signal (i.e., a binary signal equal to zero).

For example, as shown in fig. 3, the tenth multiplexer 36j may provide as an output either the output signal from the fourth multiplier 30d (e.g., if S4 ═ 0) or the output signal from the second multiplier 30b (e.g., if S4 ═ 1) according to the corresponding control signal S4. The second and gate 38b may receive the output signal from the tenth multiplexer 36j as the first input signal and the control signal G1 as the second input signal. The output of the second and gate 38b may be coupled to a second input of the second adder-subtractor 32 b.

The first accumulator 34a may receive an input signal from the output of the first adder-subtractor 32a and a control signal EN to provide a first output signal X0 of the calculation circuit 20.

The second accumulator 34b may receive an input signal from the output of the second adder-subtractor 32b and the control signal EN to provide a second output signal X1 of the calculation circuit 20.

One or more embodiments including adder-subtractors 32a, 32b may keep their operation "bypassed" by means of and gates 38a, 38b, and gates 38a, 38b may be used to force a zero signal at the second inputs of adder-subtractors 32a, 32 b.

Fig. 4 is an exemplary circuit block diagram of other embodiments of the computing circuitry 20 that may be included in one or more embodiments of the processing element 160.

One or more embodiments as shown in fig. 4 may include the same arrangement of processing resources and multiplexer circuits as discussed with reference to fig. 3, with the addition of two circuits 40a, 40b configured to compute an activation non-linear function (ANLF) and corresponding multiplexer circuits 36k, 36 m.

By way of example and with reference to fig. 4, the additional processing resources 40a, 40b may be arranged as described below.

The first ANLF circuit 40a may receive an input signal from the output of the first accumulator 34 a. The eleventh multiplexer 36k may provide the first output signal X0 of the calculation circuit 20 by selecting either the output signal from the first accumulator 34a (e.g., if S9 ═ 0) or the output signal from the first ANLF circuit 40a (e.g., if S9 ═ 1) according to the corresponding control signal S9.

The second ANLF circuit 40b may receive an input signal from the output of the second accumulator 34 b. The twelfth multiplexer 36m may provide the second output signal X1 of the calculation circuit 20 by selecting either the output signal from the second accumulator 34b (e.g., if S9 ═ 0) or the output signal from the second ANLF circuit 40b (e.g., if S9 ═ 1) according to the corresponding control signal S9.

Thus, in one or more embodiments as shown in fig. 4, ANLF circuits 40a and 40b may be "bypassed" by multiplexer circuits 36k and 36m, thereby providing operation similar to the embodiment shown in fig. 3.

Thus, as shown with reference to fig. 3 and 4, the data paths in the computational circuitry 20 may be configured to support parallel computations and may facilitate the performance of different functions. In one or more embodiments, the inner pipe may be designed to meet timing constraints (e.g., clock frequency) on minimum delay.

In the following, various non-limiting examples of possible configurations of the computing circuit 20 are provided. In each example, the calculation circuitry 20 is configured to calculate an algorithm-dependent (basic) function.

In the first example, the configuration of the calculation circuit 20 for executing a Fast Fourier Transform (FFT) algorithm is described.

In the case where hardware accelerator circuitry 16 is required to compute the FFT algorithm, the single processing element 160 may be programmed to implement a radix-2 DIF (decimated by frequency) butterfly algorithm, performing the following complex operations, for example, using signals from internal control unit 161:

X0＝P+Q

X1＝P*W0-Q*W0

where W0 may be the rotation factor stored in local read only memory 165.

In the present first example, the input signals (P, Q, W0, W1, W2, W3, W4) and the output signals (X0, X1) may be complex data types.

Optionally, in order to reduce the effect of discontinuities at the edges of the data blocks of the computed FFT algorithm on the spectrum, a window function may be applied to the input data prior to computing the FFT algorithm. Processing elements 160 may support such windowing, for example, by using four multiplier circuits.

Alternatively, the mode or phase of the spectral components may be used in place of the complex values (e.g., in radar target detection and like applications). In this case, the internal (optional) ANLF circuit may be used during the last FFT stage. For example, the input complex vector may be rotated to align with the x-axis to compute the module.

Table III, provided below, summarizes some exemplary configurations of the computation circuit 20 for computing different radix-2 algorithms.

TABLE III

Thus, the data flow corresponding to the function "radix-2 butterfly algorithm" exemplified above may be:

X0＝P+Q

X1＝P*W0-Q*W0

the data stream corresponding to the function "radix-2 butterfly algorithm + window" exemplified above may be:

X0＝W1*P+W2*Q

X1＝(W1*P)*W0-(W2*Q)*W0

the data stream corresponding to the function "radix-2 butterfly + modulo" exemplified above may be:

X0＝abs(P+Q)

X1＝abs(P*W0-Q*W0)

in a first example considered herein, a configuration corresponding to a "radix-2 butterfly algorithm" may involve the use of two multiplier circuits, two adder-subtractor circuits, no accumulator, and no ANLF circuit.

In a first example considered herein, a configuration corresponding to a "radix-2 butterfly algorithm + window" may involve the use of four multiplier circuits, two adder-subtractor circuits, no accumulator, and no ANLF circuit.

In a first example considered herein, a configuration corresponding to "radix-2 butterfly + modulo" may involve the use of two multiplier circuits, two adder-subtractor circuits, two ANLF circuits, without an accumulator.

In the second example, the configuration of the calculation circuit 20 for performing a scalar product of complex data vectors is described.

Hardware accelerator circuitry 16 may be required to compute the scalar product of the complex data vectors. This may for example be the case in relation to applications of filtering operations, such as phased array radar systems involving a processing stage known as beamforming. Beamforming techniques may help radar systems to resolve targets in angle (azimuth) based on range and radial velocity.

In the present second example, the input signals (P, Q, W0, W1, W2, W3, W4) and the output signals (X0, X1) may be complex data types.

In this second example, two different scalar vector product operations (e.g., beamforming operations) may be performed simultaneously by a single processing element 160 (e.g., by utilizing all internal hardware resources).

During beamforming operations, the local configurable coefficient memory 167 may be used to store phase shifts for different array antenna elements.

Similar to the first example, in this second example, if a modulo rather than a complex value is to be calculated, the ANLF circuit may be selected for use.

Table IV provided below illustrates a possible configuration of the calculation circuit 20 for calculating a scalar product of two vectors simultaneously.

TABLE IV

Thus, the data stream corresponding to the function "scalar product of vectors" exemplified above may be:

X0＝ACC(P*W1+Q*W2)

X1＝ACC(P*W3+Q*W4)

the data stream corresponding to the function "scalar product of vectors + modulo" exemplified above may be:

X0＝abs(ACC(P*W1+Q*W2))

X1＝abs(ACC(P*W3+Q*W4))

in a second example considered herein, a configuration corresponding to "scalar product of vectors" may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and no ANLF circuit.

In a second example considered herein, a configuration corresponding to "scalar product of vectors + modulo" may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and two ANLF circuits.

In the third example, the configuration of the calculation circuit 20 for performing a scalar product of real number data vectors is described.

The hardware accelerator circuit 16 may be required to compute the scalar product of real data vectors on large real data structures, e.g., for computing digital filters. For example, in many applications, real world (e.g., analog) signals may be filtered after being digitized in order to extract (only) relevant information.

In the digital domain, the convolution operation between the input signal and the Filter Impulse Response (FIR) may take the form of a scalar product of two real data vectors. One of the two vectors may hold input data and the other vector may hold coefficients defining a filtering operation.

In the present third example, the input signals (P, Q, W0, W1, W2, W3, W4) and the output signals (X0, X1) are real data types.

In this third example, two different filtering operations may be performed simultaneously on the same data set by a single processing element 160, for example by processing four different input data per clock cycle using all internal hardware resources.

Table V provided below illustrates a possible configuration of the calculation circuit 20 for calculating two filtering operations simultaneously on real data vectors.

TABLE V

Thus, the data flow corresponding to the above illustrated function is as follows, where subscript "h" denotes the MSB part and subscript "l" denotes the LSB part:

X0_h＝ACC(P_h*W1_h+Q_h*W2_h)

X0_l＝ACC(P_l*W1_l+Q_l*W2_l)

X1_h＝ACC(P_h*W3_h+Q_h*W4_h)

X1_l＝ACC(P_l*W3_l+Q_l*W4_l)

in a third example considered herein, a configuration corresponding to "scalar product of real vectors" may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and no ANLF circuit.

In the fourth example, the configuration of the calculation circuit 20 for calculating the nonlinear function is described.

Multilayer perceptrons (MLPs) are a class of fully connected feedforward artificial neural networks that may include at least three layers of nodes/neurons. Except for the neurons in the input layer, each neuron computes a weighted sum of all nodes of the previous layer and then applies a nonlinear activation function to the result. Processing element 160 as disclosed herein may process such non-linear functions, for example, using internal ANLF circuitry. Typically, neural networks process data from the real world and use the real weights and functions to compute class membership probabilities (the output of the last layer). Thus, for such artificial networks, the real data scalar product may be the operation with the highest computational requirements and the highest frequency of use.

Fig. 5 is an exemplary diagram of the general structure of a multi-layer perceptron network 50.

As shown in FIG. 5, the multi-layered perceptron network 50 may include an input layer 50a comprising N inputs U¹、…、U^N(UⁱI 1, …, N), the hidden layer 50b, includes M hidden nodes X¹、…、X^M(X^kK 1, …, M), the output layer 50c, includes P output nodes Y¹、…、Y^P(Y^j，j＝1、…、P)。

It should be understood that in one or more embodiments, a multi-layer perceptron network may include more than one hidden layer 50 b.

As shown in FIG. 5, a multi-layered perceptron network 50 may be included at the input U¹、…、U^NAnd hidden node X¹、…、X^MFirst set of N x M weights W in between^i,kAnd at hidden node X¹、…、X^MAnd an output node Y¹、…、Y^PSecond set of M x P weights W in between^k,j。

Stored in the input UⁱHidden node X^kAnd an output node Y^jThe value in (b) may be calculated, for example, as a MAC floating point with single precision.

Hidden node X_kAnd the value of the output node Yj may be calculated according to the following equation:

in this fourth example, the actual weights of training associated with all edges of the MLP may be stored in the local configurable coefficient memory 167. The real layer input may be from a local data store (e.g., local data store set M) of the hardware accelerator circuit 16₀、…、M_Q-1) Retrieved and the real number layer output may be stored into a local data store of the hardware accelerator circuit 16.

Since the MLP model is mapped onto hardware accelerator circuit 16, each processing element 160 included therein (e.g., P processing elements) may be used to compute scalar product and activation function outputs associated with two different neurons of the same layer, e.g., processing four edges per clock cycle. Thus, all processing elements 160 may be used simultaneously₀、160₁、…、160_P-1。

Table VI provided below illustrates a possible configuration of the computation circuit 20 for computing two activation function outputs associated with two different neurons simultaneously.

TABLE VI

Thus, the data flow corresponding to the above illustrated function is as follows, where subscript "h" denotes the MSB part and subscript "l" denotes the LSB part:

X0_h＝f(ACC(P_h*W1_h+Q_h*W2_h))

X0_l＝f(ACC(P_l*W1_l+Q_l*W2_l))

X1_h＝f(ACC(P_h*W3_h+Q_h*W4_h))

X1_l＝f(ACC(P_l*W3_l+Q_l*W4_l))

in a fourth example considered herein, a configuration corresponding to the function "MLP computation engine" (which may include computing two scalar products of a vector and applying a nonlinear activation function thereto) may involve the use of four multiplier circuits, two adder-subtractor circuits, two accumulators, and two ANLF circuits.

Table VII provided below illustrates non-linear functions that may be implemented in one or more embodiments. Some functions, denoted by "algorithm NN", may be used exclusively in the context of neural networks.

TABLE VII

Thus, the hardware acceleratorOne or more embodiments of circuitry 16, including at least one computing circuit 20 as illustrated herein and/or in the examples above, may facilitate implementation of a digital signal processing system having one or more of the following advantages: flexibility (e.g., ability to process different types of algorithms), improved use of hardware resources, improved performance of parallel computations, per processing element 160 to local data memory bank M₀、…、M_Q-1And/or extended connectivity and high bandwidth to system memory 12, through simple local interconnect network 162 and internal network 162 direct memory access controller 168₀、168₁、…、168_P-1And support for additional algorithms through an extensible architecture that integrates different processing elements.

In one or more embodiments, the electronic system 1 may be implemented as a single silicon chip or as an integrated circuit in a chip (e.g., as a system on a chip). Alternatively, the electronic system 1 may be a distributed system comprising a plurality of integrated circuits interconnected together, for example by means of a Printed Circuit Board (PCB).

As shown herein, a circuit (e.g., 160) may include a set of input terminals configured to receive a set of input digital signals (e.g., P, Q, W0, W1, W2, W3, W4) carrying input data; a set of output terminals configured to provide a set of output digital signals (e.g., X0, X1) carrying output data; and computing circuitry (e.g., 20) configured to generate output data from the input data. The computing circuitry may include: a set of multiplier circuits (e.g., 30a, 30b, 30c, 30d), a set of adder-subtractor circuits (e.g., 32a, 32b), a set of accumulator circuits (e.g., 34a, 34b), and a configurable interconnection network (e.g., 36a, …, 36j) configured to selectively couple (e.g., S1, …, S7) the multiplier circuits, the adder-subtractor circuits, the accumulator circuits, the input terminals, and the output terminals in at least two processing configurations.

As shown herein, in a first processing configuration, the computing circuitry may be configured to compute the output data according to a first set of functions, and in at least one second processing configuration, the computing circuitry may be configured to compute the output data according to a corresponding second set of functions, the corresponding second set of functions being different from the first set of functions.

As shown herein, the circuitry may include a respective configurable direct read memory access controller (e.g., 200)₀、200₁) A first subset of the set of input terminals to receive (e.g., 162, 163) a corresponding first subset of the input digital signals carrying the first subset of the input data (e.g., P, Q). The configurable direct read memory access controller may be configured to control a slave memory (e.g., M)₀、…、M_Q-1) A first subset of input data is acquired.

As shown herein, the circuitry may include a respective configurable direct write memory access controller (e.g., 204)₀、204₁) Coupled to the set of output terminals to provide an output digital signal carrying output data. The configurable direct write memory access controller may be configured to control storage of output data into the memory.

As shown herein, the circuitry may include respective input buffer registers (e.g., 202) coupled to a configurable direct-read memory access controller₀、202₁) And respective output buffer registers (e.g., 206) coupled to the configurable write direct memory access controller₀、206₁)。

As shown herein, the circuitry may include ROM address generator circuitry (e.g., 208) configured to control retrieval of a second subset of input data (e.g., W0) from at least one read-only memory (e.g., 164, 165) via the second subset of input digital signals, and/or memory address generator circuitry (e.g., 210) configured to control retrieval of a third subset of input data (e.g., W1, W2, W3, W4) from at least one configurable memory (e.g., 166, 167) via the third subset of input digital signals.

As shown herein, in a circuit according to an embodiment, the set of multiplier circuits may include a first multiplier circuit (e.g., 30a), a second multiplier circuit (e.g., 30b), a third multiplier circuit (e.g., 30c), and a fourth multiplier circuit (e.g., 30 d). The adder-subtractor circuit set may include a first adder-subtractor circuit (e.g., 32a) and a second adder-subtractor circuit (32 b). The set of accumulator circuits may include a first accumulator circuit (e.g., 34a) and a second accumulator circuit (e.g., 34 b).

As shown herein, the first multiplier circuit may receive a first input signal (e.g., W1) of the input set of digital signals as a first operand and may receive a second input signal (e.g., P) of the input set of digital signals as a second operand. The second multiplier circuit may receive a third input signal (e.g., Q) of the set of input digital signals as the first operand and may receive selectable ones of a fourth input signal (e.g., W2) and a fifth input signal (e.g., W4) of the set of input digital signals as the second operand. The third multiplier circuit may receive as the first operand a signal selectable from the output signal from the first multiplier circuit and the second input signal, and may receive as the second operand a signal selected from the sixth input signal (e.g., W3), the seventh input signal (e.g., W0), and the fifth input signal. The fourth multiplier circuit may receive as the first operand a signal selectable from the output signal from the second multiplier and the third input signal, and may receive as the second operand a signal selected from the fifth input signal and the seventh input signal. The first adder-subtractor circuit may receive, as the first operand, a signal selectable from among the output signal from the first multiplier circuit, the second input signal, and the output signal from the third multiplier circuit, and may receive, as the second operand, a signal selectable from among the third input signal, the output signal from the second multiplier circuit, and the zero signal. The second adder-subtractor circuit may receive, as the first operand, a signal selectable from an output signal from the third multiplier circuit and an output signal from the fourth multiplier circuit, and may receive, as the second operand, a signal selectable from an output signal from the fourth multiplier circuit, an output signal from the second multiplier circuit, and a zero signal. The first accumulator circuit may receive as input the output signal from the first adder-subtractor circuit and the second accumulator circuit may receive as input the output signal from the second adder-subtractor circuit. The first accumulator circuit may be selectively activated (e.g., EN) to provide a first output signal (e.g., X0), and the second accumulator circuit may be selectively activated to provide a second output signal (e.g., X1).

As shown herein, the computation circuitry may include a set of circuits (e.g., 40a, 40b) configured to compute a non-linear function.

As shown herein, a set of circuits configured to compute a nonlinear function may include a first circuit (e.g., 40a) configured to compute a nonlinear function and a second circuit (e.g., 40b) configured to compute a nonlinear function. A first circuit configured to calculate a non-linear function may receive as an input an output signal from the first accumulator circuit. A second circuit configured to calculate a non-linear function may receive as an input the output signal from the second accumulator circuit. The first output signal may be selectable between the output signal from the first accumulator circuit and the output signal from the first circuit configured to calculate the non-linear function (e.g., 36k), and the second output signal may be selectable between the output signal from the second accumulator circuit and the output signal from the second circuit configured to calculate the non-linear function (e.g., 36 m).

As shown herein, a device (e.g., 16) may include a set of circuits, a set of data storage banks (e.g., M) in accordance with one or more embodiments₀、…、M_Q-1) And a control unit (e.g., 161). In accordance with configuration data stored in the control unit, the circuitry may be configured (e.g., 161, 168) to read data from and write data to the data memory banks via the interconnection network (e.g., 162, 163).

As shown herein, the data memory bank may include buffer registers, preferably double buffer registers.

As shown herein, a system (e.g., 1) may include a device according to one or more embodiments and a processing unit (e.g., 10) coupled to the device via a system interconnect (e.g., 18). The circuits in the circuit set of the device may be configured in at least two processing configurations according to control signals received from the processing unit.

As shown herein, a method of operating a circuit according to one or more embodiments, an apparatus according to one or more embodiments, or a system according to one or more embodiments may include dividing an operating time of a computing circuitry in at least first and second operating intervals, wherein the computing circuitry operates in a first processing configuration and at least one second processing configuration, respectively.

Without prejudice to the underlying principles, the details and the embodiments may vary, even significantly, with respect to what has been described purely by way of example, without thereby departing from the scope of protection.

The scope of protection is defined by the appended claims.

While the invention has been described with reference to exemplary embodiments, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative embodiments, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to the description. It is therefore intended that the appended claims encompass any such modifications or embodiments.

28页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：面向自供能系统的存算一体架构及软件优化方法

Circuit, corresponding device, system and method

相关技术

网友询问留言