Systems, methods, and apparatus for near data processing

文档序号：228426 发布日期：2021-11-09 浏览：18次中文

阅读说明：本技术 用于近数据处理的系统、方法和设备 (Systems, methods, and apparatus for near data processing ) 是由皇甫文沁 K.T.马拉迪姜冬岩于 2021-05-07 设计创作，主要内容包括：存储器模块可以包括一个或多个存储器设备,以及耦合到一个或多个存储器设备的近存储器计算模块；近存储器计算模块包括被配置为处理来自一个或多个存储器设备的数据的一个或多个处理元件,以及被配置为协调来自主机和一个或多个处理元件的对一个或多个存储器设备的访问的存储器控制器。一种处理数据集的方法可以包括：将数据集的第一部分分布到第一存储器模块,将数据集的第二部分分布到第二存储器模块,基于数据集的第一部分在第一存储器模块处构建第一本地数据结构,基于数据集的第二部分在第二存储器模块处构建第二本地数据结构,以及合并第一和第二本地数据结构。(The memory module may include one or more memory devices, and a near memory computing module coupled to the one or more memory devices; the near memory computing module includes one or more processing elements configured to process data from the one or more memory devices, and a memory controller configured to coordinate access to the one or more memory devices from the host and the one or more processing elements. A method of processing a data set may include: the method includes distributing a first portion of a data set to a first memory module, distributing a second portion of the data set to a second memory module, building a first local data structure at the first memory module based on the first portion of the data set, building a second local data structure at the second memory module based on the second portion of the data set, and merging the first and second local data structures.)

1. A memory module, comprising:

one or more memory devices; and

a near memory computation module coupled to the one or more memory devices, the near memory computation module comprising:

one or more processing elements configured to process data from the one or more memory devices; and

a memory controller configured to coordinate access to the one or more memory devices from a host and the one or more processing elements.

2. The memory module of claim 1, wherein at least one of the one or more processing elements is configured to process data from the one or more memory devices by performing a count operation on the data.

3. The memory module of claim 1, wherein the near memory computation module is configured to control a first memory device of the one or more memory devices based on a first chip select signal and to control a second memory device of the one or more memory devices based on a second chip select signal.

4. The memory module of claim 1, wherein the one or more memory devices and the near memory compute module are arranged in a first tier, the memory module further comprising:

a second stage; and

a hierarchical bus structure configured to transfer data between the first level and the second level.

5. The memory module of claim 4, wherein:

the one or more memory devices include one or more first memory devices, the near memory compute module includes a first near memory compute module, and the one or more processing elements include one or more first processing elements; and is

The second stage comprises:

one or more second memory devices; and

a second near memory computation module coupled to the one or more second memory devices, the second near memory computation module comprising:

one or more second processing elements configured to process data from the one or more second memory devices; and

a second memory controller configured to coordinate access to the one or more second memory devices from a host and the one or more second processing elements.

6. The memory module of claim 1, wherein:

the memory module further includes a hierarchical bus structure; and is

The near memory computing module further comprises:

an input buffer coupled between the hierarchical bus structure and the one or more processing elements; and

an output buffer coupled between the hierarchical bus structure and the one or more processing elements.

7. The memory module of claim 1, wherein the near memory computing module further comprises a workload monitor configured to balance a first workload of a first of the one or more processing elements and a second workload of a second of the one or more processing elements.

8. A method of processing a data set, the method comprising:

distributing a first portion of the data set to a first memory module;

distributing a second portion of the data set to a second memory module;

building a first local data structure at the first memory module based on a first portion of the data set;

building a second local data structure at the second memory module based on a second portion of the data set; and

merging the first local data structure and the second local data structure.

9. The method of claim 8, wherein:

merging the first local data structure and the second local data structure to form a merged data structure; and is

The method also includes performing a count operation on the merged data structure at the first memory module and the second memory module.

10. The method of claim 8, wherein:

merging the first local data structure and the second local data structure comprises simplifying the first local data structure and the second local data structure;

simplifying the first local data structure and the second local data structure to form a merged data structure; and is

The method also includes scattering the merged data structure to the first memory module and the second memory module.

11. The method of claim 8, further comprising distributing the first portion of the data set to two or more memory devices at the first memory module.

12. The method of claim 11, further comprising distributing the first portion of the data set to two or more levels at the first memory module.

13. The method of claim 8, wherein:

building the first local data structure is performed by a first processing element and a second processing element; and is

The method also includes balancing a first workload of the first processing element and a second workload of the second processing element.

14. The method of claim 8, further comprising interleaving memory accesses to the first portion of the data set between the first task and the second task.

15. The method of claim 14, further comprising switching between the first task and the second task between memory accesses to the first portion of the data set.

16. The method of claim 8, wherein merging the first local data structure and the second local data structure forms a first merged data structure, the method further comprising:

means for distributing the first consolidated data structure to the first memory module and the second memory module;

building a third local data structure at the first memory module based on the first consolidated data structure;

building a fourth local data structure at the first memory module based on the first consolidated data structure;

merging the third local data structure and the fourth local data structure to form a second merged data structure; and

performing a count operation on the second consolidated data structure at the first memory module and the second memory module.

17. The method of claim 8, wherein:

the dataset comprises gene sequences;

the first local data structure comprises a bloom filter; and is

The bloom filter includes one or more k-mers of the gene sequence.

18. A system for processing a data set, comprising:

a first memory module configured to build a first local data structure based on a first portion of a data set;

a second memory module configured to construct a second local data structure based on a second portion of the data set; and

a host coupled to the first memory module and the second memory module by one or more memory channels, wherein the host is configured to:

distributing a first portion of the data set to the first memory module;

distributing a second portion of the data set to the second memory module; and

merging the first local data structure and the second local data structure.

19. The system of claim 18, wherein:

simplifying the first local data structure and the second local data structure to form a merged data structure; and is

The host is further configured to scatter the consolidated data structure to the first memory module and the second memory module.

20. The system of claim 19, wherein the first memory module is configured to perform a count operation on the consolidated data structure.

Technical Field

The present disclosure relates generally to data processing, and more particularly, to systems, methods, and devices for near data processing.

Background

Some data processing applications, such as k-mer counting, may involve accessing and/or processing small data units randomly dispersed throughout a data set stored in a working memory space.

The above information disclosed in the background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not form the prior art.

Disclosure of Invention

A memory module may include one or more memory devices, and a near memory computing module coupled to the one or more memory devices; the near memory computing module includes one or more processing elements configured to process data from the one or more memory devices, and a memory controller configured to coordinate access to the one or more memory devices from the host and the one or more processing elements. The near memory computation module may be configured to control a first memory device of the one or more memory devices based on a first chip select signal and to control a second memory device of the one or more memory devices based on a second chip select signal. The one or more memory devices and the near memory computation module are arranged in a first rank (rank), the memory module further comprising a second rank, and a hierarchical bus structure configured to transfer data between the first rank and the second rank. The one or more memory devices may include one or more first memory devices, the near memory computation module may include a first near memory computation module, and the one or more processing elements may include one or more first processing elements, and the second tier may include one or more second memory devices, and a second near memory computation module coupled to the one or more second memory devices, the second near memory computation module including one or more second processing elements configured to process data from the one or more second memory devices, and a second memory controller configured to coordinate access to the one or more second memory devices from the host and the one or more second processing elements. The memory module may further include a hierarchical bus structure, and the near memory computing module may further include an input buffer coupled between the hierarchical bus structure and the one or more processing elements, and an output buffer coupled between the hierarchical bus structure and the one or more processing elements. The near memory computing module may also include a workload monitor configured to balance a first workload of a first processing element of the one or more processing elements and a second workload of a second processing element of the one or more processing elements.

A method of processing a data set may include: the method includes distributing a first portion of a data set to a first memory module, distributing a second portion of the data set to a second memory module, building a first local data structure at the first memory module based on the first portion of the data set, building a second local data structure at the second memory module based on the second portion of the data set, and merging the first local data structure and the second local data structure. Merging the first local data structure and the second local data structure may form a merged data structure, and the method may further include performing a count operation on the merged data structure at the first memory module and the second memory module. Merging the first local data structure and the second local data structure may include simplifying (reduce) the first local data structure and the second local data structure. Simplifying the first local data structure and the second local data structure may form a consolidated data structure, and the method may further include scattering the consolidated data structure to the first memory module and the second memory module. The method may also include distributing the first portion of the data set to two or more memory devices at the first memory module. The method may also include distributing the first portion of the data set to two or more levels at the first memory module. The first local data structure may be constructed by the first processing element and the second processing element at the first memory module, and the method may further include balancing a first workload of the first processing element and a second workload of the second processing element. The method may also include interleaving memory accesses of the first portion of the data set between the first task and the second task. The method may also include switching between the first task and the second task between memory accesses of the first portion of the data set. Merging the first local data structure and the second local data structure may form a first merged data structure, and the method may further include scattering the merged data structure to the first memory module and the second memory module, building a third local data structure at the first memory module based on the first merged data structure, building a fourth local data structure at the first memory module based on the first merged data structure, merging the third local data structure and the fourth local data structure to form a second merged data structure, and performing a count operation on the second merged data structure at the first memory module and the second memory module. The dataset may include a gene sequence (genetic sequence), the first local data structure may include a Bloom filter (Bloom filter), and the Bloom filter may include one or more k-mers of the gene sequence.

A system may include: a first memory module configured to build a first local data structure based on a first portion of a data set; a second memory module configured to build a second local data structure based on a second portion of the data set; and a host coupled to the first memory module and the second memory module through one or more memory channels, wherein the host may be configured to distribute a first portion of the data set to the first memory module, distribute a second portion of the data set to the second memory module, and merge the first local data structure and the second local data structure. The first local data structure and the second local data structure may form a merged data structure, and the host may be further configured to scatter the merged data structure to the first memory module and the second memory module. The first memory module may be configured to perform a count operation on the consolidated data structure.

Drawings

The figures are not necessarily to scale and, for purposes of illustration, elements of similar structure or function may be generally represented by like reference numerals or portions thereof throughout the figures. The drawings are only intended to facilitate the description of the various embodiments described herein. The drawings do not depict every aspect of the teachings disclosed herein and do not limit the scope of the claims. In order to prevent the drawings from becoming blurred, not all components, connections, etc. may be shown and not all components have reference numerals. However, the mode of component configuration can be readily apparent from the drawings. The drawings illustrate example embodiments of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates an embodiment of a k-mer counting method according to an example embodiment of the present disclosure.

FIG. 2 illustrates an embodiment of a k-mer counting method and data structure according to an example embodiment of the present disclosure.

FIG. 3 illustrates an embodiment of a bloom filter for a k-mer counting method according to an example embodiment of the present disclosure.

FIG. 4 illustrates an embodiment of a near data processing system according to an example embodiment of the present disclosure.

FIG. 5 illustrates an embodiment of a memory module with near data processing capability according to an example embodiment of the present disclosure.

FIG. 6 illustrates an embodiment of a memory module having multi-level and near data processing capabilities according to an example embodiment of the present disclosure.

FIG. 7 illustrates an example embodiment of a near data processing system according to an example embodiment of the present disclosure.

FIG. 8 illustrates an example embodiment of a memory module according to an example embodiment of the present disclosure.

FIG. 9 illustrates a more detailed example embodiment of a rank of memory modules or a portion of a memory module according to an example embodiment of the present disclosure.

FIG. 10 illustrates an example embodiment of a processing element according to an example embodiment of the present disclosure.

Fig. 11A illustrates an embodiment of a first bloom filter building operation and a second bloom filter building operation according to an example embodiment of the present disclosure.

Fig. 11B illustrates an embodiment of a first merge operation and a second merge operation according to an example embodiment of the present disclosure.

Fig. 11C illustrates an embodiment of a counting operation according to an example embodiment of the present disclosure.

FIG. 12 illustrates an embodiment of a k-mer counting method using a global dataset according to an example embodiment of the present disclosure.

FIG. 13 illustrates an embodiment of a k-mer counting method workflow according to an example embodiment of the present disclosure.

Fig. 14A illustrates an embodiment of a counting bloom filter build operation, according to an example embodiment of the present disclosure.

Fig. 14B illustrates an embodiment of a merge operation according to an example embodiment of the present disclosure.

Fig. 14C illustrates an embodiment of a counting operation according to an example embodiment of the present disclosure.

FIG. 15A illustrates an embodiment of a merge address mapping scheme for a memory device according to an example embodiment of the present disclosure.

FIG. 15B illustrates another embodiment of a scattered address mapping scheme for a memory device, according to an example embodiment of the present disclosure.

FIG. 16A illustrates an example embodiment of a sequential memory access method of a bloom filter according to an example embodiment of the present disclosure.

FIG. 16B illustrates an example embodiment of a method of scattered memory access by a bloom filter according to an example embodiment of the present disclosure.

Fig. 17 shows an embodiment of a method of processing a data set according to an example embodiment of the present disclosure.

Detailed Description

SUMMARY

Some data processing applications may involve accessing and/or processing many small data units (e.g., one bit) randomly dispersed throughout a large data set stored in a working memory space. This type of fine-grained data access and processing is inefficient and time consuming when implemented with a Central Processing Unit (CPU) or other processing unit having a large data width (e.g., 32 or 64 bits), particularly when the data is stored remotely from the processor in a memory system having a bus configured to sequentially access large units of data (e.g., a plurality of 64-bit words (words) stored in contiguous storage locations.

A Near Data Processing (NDP) system according to an example embodiment of the present disclosure may include a memory module having one or more processing resources configured to process data at the module, thereby reducing or eliminating data transmission to a host processing unit. In some embodiments, a near data processing system may include one or more features that may facilitate accessing and/or processing fine-grained data units, and/or efficient communication within a memory module.

For example, some embodiments may include one or more near-memory computing (NMC) modules, which may include one or more processing elements configured to process multiple relatively small units of data in parallel and/or to implement one or more particular algorithms or portions thereof. In some embodiments, the near memory compute module may include one or more features, such as a memory controller, a workload monitor, a bus controller, one or more buffers, multiplexers, etc., to support data communication and/or synchronization, implement task scheduling and/or memory access, etc. In some embodiments, a near memory compute module may be included for each level of memory devices in the memory module.

As another example, some embodiments may implement a hierarchical bus architecture that may enable efficient data transfer between levels within a module, between memory devices within a level, and so forth. Some embodiments may implement separate Chip Select (CS) signals that may enable fine-grained memory access for a single memory device.

A workflow according to an example embodiment of the present disclosure may distribute (distribute) portions of a data set to multiple memory modules with near data processing capabilities. The memory modules may process portions of their respective data sets to build local data structures, which may then be merged and/or redistributed to the memory modules for further processing.

For example, an embodiment of the workflow may implement a k-mer counting process, wherein the memory module may build a local bloom filter. The data from the bloom filter may then be combined (e.g., by a simplified process) and then dispersed to a memory module. The memory modules may then perform a local counting process on their copies of the merged and scattered bloom filters. In some embodiments, the memory module may implement a counting bloom filter, which may facilitate the use of local processing.

Some embodiments may implement a distributed data mapping scheme to distribute data across memory devices, ranks, memory modules, and the like. Depending on implementation details, this may improve memory bandwidth utilization and/or energy utilization, for example, by reducing or eliminating concentration (concentration) of data within memory devices, ranks, memory modules, and so forth.

Some embodiments may implement a task scheduling scheme based on monitoring the workload of one or more computing resources. For example, a task scheduling scheme may balance the workload between processing elements in a near memory compute module. Depending on implementation details, this may increase utilization of one or more computing resources.

Some embodiments may implement a decentralized memory access and/or task switching scheme. For example, memory accesses of one task may be interleaved with memory accesses of one or more other tasks (interleaved). Depending on implementation details, this may reduce or eliminate unavailable data access. Additionally or alternatively, tasks may switch between memory accesses, e.g., filling memory access slots with accesses of available data. Depending on implementation details, this may improve memory bandwidth utilization and/or reduce power consumption.

The principles disclosed herein have independent utility and may be embodied separately and not every embodiment may utilize every principle. However, these principles may also be embodied in various combinations, some of which may amplify the benefits of each principle in a synergistic manner.

K-mer counting

K-mer counts can be used to determine the number of subsequences of length K in a dataset comprising gene sequences (e.g., deoxyribonucleic acid (DNA) sequences). For example, a k-mer counting process of k-3 can be used to find the number of 3-mers of each type in the DNA sequence shown at the top of fig. 1. The result of the 3-mer counting process with k-3 can be shown in the table at the bottom of fig. 1. Thus, the DNA sequence may comprise one ATC3-mer, two TCT 3-mers, one CTC 3-mer, etc. In some embodiments, k-mer counts may be used for applications such as bioinformatics, drug development, evolution research, crop improvement, forensic analysis, gene therapy, Next Generation Sequencing (NGS), and the like. In some embodiments, applications such as NGS technology may help characterize 2019 global pandemic coronavirus disease (COVID-19).

In some applications, unique k-mers (e.g., k-mers that may only occur once in a data set) are likely to contain errors. Thus, some k-mer counting processes may filter out unique k-mers. For example, in some applications, up to 75% of k-mers may be unique, and thus, eliminating unique k-mers may significantly reduce the memory footprint (font) of a DNA sequence dataset.

FIG. 2 illustrates an embodiment of a k-mer counting method and data structure according to an example embodiment of the present disclosure. The method shown in fig. 2 may receive as input a DNA sequence 211. The method may read and process k-mers in the DNA sequence 211 to construct a series of three data structures, which may be implemented as a first bloom filter 213, a second bloom filter 215, and a hash table 217 in this embodiment.

The first bloom filter 213 may include a bit array having a capacity of m bits, and n independent hash functions may be used to insert k-mers into the array. The bloom filter may be initially filled with all zeros. When reading k-mers from the DNA sequence 211, n hash functions may be applied to the k-mers to calculate n hash values. N bits in the bit array corresponding to the n hash values may then be set. For example, if n is 3, applying three separate hash functions h1, h2, and h3 to the first k-mer (x) may generate hash values 1, 5, and 13, respectively (e.g., h1(x) 1, h2(x) 5, and h3(x) 13). Thus, the bits at positions 1, 5 and 13 may be set as shown in FIG. 3. Applying three hash functions to the second k-mer (y) may generate hash values 4, 11, and 16, respectively. Thus, the bits at positions 4, 11 and 16 may be set as shown in FIG. 3. Applying three hash functions to the third k-mer (z) may generate hash values 3, 5, and 11, respectively. Thus, the bit at position 3 may be set (the bits at positions 5 and 11 may have been set) as shown in FIG. 3.

To check for the presence of a particular k-mer in the first bloom filter 213, n hash functions may be applied to the particular k-mer to calculate n hash values. The bit array may then be examined for entries in n bit positions corresponding to the n hash values. If any of the n entries is zero, it may indicate that no particular k-mer is present in bloom filter 213. (in some embodiments, bloom filter 213 may have a zero false negative (false negative) rate.) if all n entries are 1, it may be highly certain to indicate that a particular k-mer is present in the array. (in some embodiments, bloom filter 213 may have a low false positive (false positive) rate.)

For example, if n-3, applying three separate hash functions to a particular k-mer (w) may generate hash values 4, 13, and 15, as shown in fig. 3. Since bloom filter 213 may contain a zero at bit position 15, it may be inferred that a particular k-mer is not present.

Referring again to fig. 2, the second bloom filter 215 may be constructed and operated in a manner similar to the first bloom filter 213. In some embodiments, a series of two bloom filters 213 and 215 may be used to implement a pruning (pruning) process that may filter out unique k-mers as follows. Whenever a new k-mer (w) is read from the data set 211, the method may check for the presence of a new k-mer (w) in the first bloom filter 213. If a new k-mer (w) is present in the first bloom filter 213, this may indicate that it is not a unique k-mer. Thus, a new k-mer (w) may be written to the second bloom filter 215. However, if a new k-mer (w) is not present in the first bloom filter 213, it may be added to the first bloom filter 213 such that future checks of the first bloom filter 213 will indicate that an instance of the new k-mer (w) has been encountered.

After all k-mers in the input DNA sequence dataset 211 are read and processed in this manner, all non-unique k-mers may be stored in the second bloom filter 215, where the unique k-mers have been filtered out. In some embodiments, the first bloom filter 213 may then be discarded.

The non-unique k-mers stored in the second bloom filter 215 may then be counted as follows. For each input k-mer (w) read from the DNA sequence dataset 211, the second bloom filter 215 may be checked for the presence of the input sequence k-mer (w). If the input k-mer (w) is present in the second bloom filter 215, the corresponding entry (e.g., frequency counter) of the input k-mer (w) in the hash table 217 may be incremented. In some embodiments, the value of the frequency counter for k-mer (w) may indicate the number of instances of k-mer (w) in the DNA sequence data set 211. Thus, after all k-mers have been read and processed, hash table 217 may store the frequency of occurrence of all non-unique k-mers in data set 211.

In some embodiments, the method illustrated in FIG. 2 may involve many fine-grained memory access and/or processing operations performed on relatively small data units for both bloom filters 213 and 215 and hash table 217.

Near data processing architecture

FIG. 4 illustrates an embodiment of a near data processing system according to an example embodiment of the present disclosure. The system shown in FIG. 4 may include a host 402 and two or more memory modules 404. The host 402 may be connected to the memory module 404 through one or more memory channels 406. The system shown in FIG. 4 may also include logic 408 and/or 410 that may cause the component to distribute portions of the data set to the memory module 404, process the portions of the data set to build local data structures at the memory module 404, merge the local data structures, and/or perform one or more count operations on the merged data structures.

Logic 408 and/or 410 and/or portions thereof may be located at any of the components shown in fig. 4. For example, in some embodiments, logic 408 may be located at host 402 and may control the distribution of data sets to memory modules 404 and/or the consolidation of local data structures, while logic 410 may be located at each memory module 404 and may control the processing of the distributed portions of data sets, the building of local data structures, and/or the counting operations at the memory modules 404.

Host 402 may be implemented with any type of processing device, such as one or more CPUs, Graphics Processing Units (GPUs), Neural Processing Units (NPUs), Tensor Processing Units (TPUs), and so on, including Complex Instruction Set Computer (CISC) processors (such as x86 processors) and/or Reduced Instruction Set Computer (RISC) processors (such as ARM processors) that execute instructions stored in any type of memory, and so on.

The one or more memory channels 406 may be implemented with any type of interface suitable for memory interconnects, such as any generation of a Double Data Rate (DDR) interface, an Open Memory Interface (OMI), peripheral component interconnect express (PCIe), a compute express link (CXL), an advanced extensible interface (AXI), an open coherent accelerator processor interface (OpenCAPI), Gen-Z, and so forth. The one or more memory channels 406 may be arranged in any configuration, e.g., in two memory modules 404 connected to the host 402 by a single multi-drop bus, in each memory module 404 connected to the host 402 by a separate point-to-point bus, etc.

The memory module 404 may be implemented with any type and/or configuration of memory devices, buffers, interfaces (including those described above), etc., arranged in any physical configuration. For example, the memory module 404 may be implemented with any type of memory device, including volatile memory (such as Dynamic Random Access Memory (DRAM) and/or Static Random Access Memory (SRAM)), non-volatile memory (such as flash memory including non-NAND (NAND) memory), persistent memory (such as cross-grid non-volatile memory), memory with body resistance variations, and/or the like, and/or any combination thereof. The memory module 404 may be configured as a Single Inline Memory Module (SIMM), a Dual Inline Memory Module (DIMM), a small outline memory module (SO-DIMM), a reduced load memory module (LRDIMM), or the like. In some embodiments, one or more of the memory modules 404 may not be implemented as separate physical components, but may simply be implemented as part of, for example, a circuit board having one or more memory devices and any supporting circuitry, traces, etc., and capable of functioning as a memory module.

Logic 408 and/or 410 and/or portions thereof may be implemented in hardware, software, or any combination thereof. For example, in some embodiments, any logic may be implemented with combinational logic, sequential logic, one or more timers, counters, registers, state machines, volatile memories (such as DRAM and SRAM), non-volatile memories (such as flash memory), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), CISC and/or RISC processors that execute instructions, etc., as well as CPUs, GPUs, NPUs, TPUs, etc.

FIG. 5 illustrates an embodiment of a memory module with near data processing capability according to an example embodiment of the present disclosure. The memory module 504 shown in FIG. 5 may be used, for example, to implement any of the memory modules 404 shown in FIG. 4. The memory modules 504 shown in FIG. 5 may include one or more ranks of memory devices 510, a near memory compute module 512, a bus structure 514, and a memory channel interface 516.

The one or more memory devices 510 may be implemented with any type and/or configuration of memory device, for example, as described above with respect to memory module 404.

The near memory computing module 512 may include one or more processing elements capable of processing any data units received from a rank of the memory device 510, another rank of the memory device, and/or another memory module via the bus structure 514, etc. In some embodiments, near memory compute module 512 may include one or more features, such as any type of general and/or special purpose controller (such as a memory controller, bus controller, etc.), a workload monitor, one or more input and/or output buffers, one or more multiplexers and/or demultiplexers, etc., to implement task scheduling, memory access, etc. In some embodiments, near memory computation module 512 may perform any of a variety of functions, such as k-mer counting, encryption and/or decryption, deduplication, encoding and/or decoding, error correction, any type of data filtering, and so forth. In some embodiments, the near memory computing module 512 may implement any type of general controller functionality, such as, for example, input and/or output (I/O) control, one or more control algorithms, supervisory control, baseboard control, programmable logic control, process control, and the like. Near memory computing module 512 and/or any portion and/or features thereof may be implemented in hardware, software, or any combination thereof, as described above with respect to logic 408 and/or 410.

The bus structure 514 may be implemented, for example, with a hierarchical bus architecture that may transfer data between memory devices within a rank, between a memory device and the memory channel interface 516, and so on. In some embodiments, independent chip select signals may enable fine-grained memory access of a single memory device 510. Bus structure 514 may be implemented using any type of signaling and/or configuration, including multi-drop, point-to-point, etc.

FIG. 6 illustrates another embodiment of a memory module with near data processing capability according to an example embodiment of the present disclosure. The memory module 604 shown in fig. 6 may include components similar to those shown in fig. 5, however, multiple instances of one or more memory devices 610, near memory computing modules 612, and/or bus structures 614 may be configured as multiple levels 618, for example.

In the embodiment shown in FIG. 6, one or more bus structures 614 may be implemented with a hierarchical bus architecture that may transfer data between different levels 618 and between memory devices within a level, between a memory device and a memory channel interface 616, and so forth.

The embodiments illustrated in fig. 4-6 are not limited to any particular application and may be used to implement a variety of processing methods, workflows, etc. However, depending on implementation details, they may be particularly effective for implementing data processing applications that may be memory limited and/or may involve fine-grained memory accesses, such as k-mer counting, DNA seeding (seeding), and the like.

To illustrate the principles of the present disclosure, some example embodiments of systems, memory modules, near memory computing modules, methods, workflows, etc. may be described in the context of k-mer counting applications, with some specific implementation details, such as LRDIMMs, DDR memory interfaces, etc. However, these principles are not limited to k-mer counting applications and may be applied to a wide variety of other applications and/or may be implemented with many different implementation details.

For purposes of illustration, fig. 7-10 collectively illustrate an example architecture of a near data processing system and some example components that may be used therewith, according to an example embodiment of the present disclosure.

FIG. 7 illustrates an example embodiment of a near data processing system according to an example embodiment of the present disclosure. The system 700 shown in FIG. 7 may include a host CPU 702 and one or more memory controllers 703, the memory controllers 703 being configured to implement DDR memory channels, which in the embodiment shown in FIG. 7 may include memory channels CH1, CH2, CH3, and/or CH 4. One or more of the memory channels may mount one or more DIMMs 704. The number of components shown in fig. 7 is merely illustrative, and fewer or more components may be used. To prevent the drawing from becoming obscured, all components may not be displayed.

FIG. 8 illustrates an example embodiment of a memory module according to an example embodiment of the present disclosure. The embodiment shown in FIG. 8 may, for example, be used to implement any of DIMMs 704 shown in FIG. 7. The memory module 804 shown in FIG. 8 may be implemented, for example, with an LRDIMM, which may include one or more stages 820, each of which may include one or more memory devices 810, and a near memory computing module 824. The memory module 804 can also include a Register Clock Driver (RCD)826 and one or more data buffers 828, which one or more data buffers 828 can collectively form a memory buffer. Register clock driver 826 may buffer and/or repeat command and/or address signals (C/a) received from a host (e.g., 702) over a DDR memory channel. One or more data buffers 828 may buffer data signals DQ to and/or from the memory device 810 to maintain and/or improve signal integrity. The memory module 804 may be physically connected to one of the memory channels (e.g., CH1, CH2 … …) through a card edge connector.

FIG. 9 illustrates a more detailed example embodiment of a rank of a memory module or portion thereof according to an example embodiment of the present disclosure. The embodiment shown in FIG. 9 may be used, for example, to implement the portion shown by dashed line 830 of one of the stages 820 shown in FIG. 8. The stage 920 shown in FIG. 9 may include one or more memory devices 910, each of which memory devices 910 may have a region (DNA) for storing a portion of a data set having an input DNA sequence, a region (Hash) for storing a portion of one or more Hash tables as an output of a k-mer counting operation, and a region (Filter) for storing one or more bloom filters for the k-mer counting operation.

The near memory computing module 924 may include one or more processing elements 932 and a control module 934, and the control module 934 may include one or more of a memory controller 936, a workload monitor 938, and/or a bus controller 940. Near memory computing module 924 may also include an input buffer 942 and an output buffer 944 for processing elements 932. The input buffer 942 may be coupled to the hierarchical bus structure 914 through a multiplexer 946 and the output buffer 944 may be coupled to the hierarchical bus structure 914 through a demultiplexer 948. The multiplexer 946 and demultiplexer 948 may be controlled by one or more enable signals EN from the control module 934.

Memory device 910 may be controlled by one or more C/A signals and/or CS signals generated by control module 934, at least partially in response to one or more host C/A signals received from a host through registered clock driver 926. Data may be transferred to and/or from the memory device 910 through a data bus 914a, which may be part of the hierarchical bus structure 914. The hierarchical bus structure 914 may couple the stages 920 to the DDR memory channel through one or more data buffers 928.

FIG. 10 illustrates an example embodiment of a processing element according to an example embodiment of the present disclosure. The processing elements shown in fig. 10 may be used, for example, to implement any of the processing elements 932 shown in fig. 9. The processing element 1032 shown in fig. 10 may include an input buffer 1050 and a hash module 1052, which may receive input from the input buffer 1050 and may provide a feedback output to the input buffer 1050 over a feedback connection 1051. The processing elements 1032 may also include an address translation module 1054, which address translation module 1054 may receive input from the hash module 1052 and provide output to the memory controller 936 and/or the output buffer 944.

Referring to fig. 7-10, in some embodiments, and depending on implementation details, centralized logic (such as one or more processing elements) and/or control functions (such as memory control, workload monitoring, workload control, and/or bus control within each stage) may improve communication and/or synchronization between components within and/or between stages within each stage. The centralized logic may also improve memory access management, task management, etc.

The use of a hierarchical bus may increase the speed and/or efficiency of communication within the memory modules (e.g., enhance intra-DIMM communication) and/or may reduce communication between the memory modules (e.g., simplify inter-DIMM communication). In some embodiments, the hierarchical bus may enable data to be transferred between memory devices (e.g., inter-chip communication). In some embodiments, the hierarchical bus may include a level-level C/A bus that may transfer C/A signals between different levels within the same memory module, and/or a level-level data bus that may transfer data between different levels within the same memory module. In some embodiments, intra-DIMM communication may be implemented locally using one or more inter-level buses without sending data to the host over a memory channel, which may reduce or eliminate a source of communication bottlenecks.

Referring to fig. 9, in some embodiments, the number of Processing Elements (PE)932 within each near memory compute module may be configurable. Processing element 932 may read input data from input buffer 942 and write output data to output buffer 944 in near memory computing module 924. In some embodiments, one or more of the processing elements 932 may be configured to process multiple relatively small units of data in parallel and/or to implement one or more particular algorithms or portions thereof. For example, in some embodiments, the primary function of one or more of processing elements 932 may be to apply one or more hash functions (e.g., murmurmurhash 3) to input data, such as k-mers read from a portion of a DNA sequence dataset stored in memory device 910.

Referring to fig. 10, in some embodiments of the processing elements 1032, the buffer 1050 may store an input k-mer from a DNA sequence, the hash module 1052 may include logic (e.g., independently functional) to perform one or more hash functions on the k-mer, and the address translation module 1054 may include logic to translate virtual addresses to local memory device (e.g., DRAM) addresses. In some embodiments, the address translation module 1054 may be used to implement, at least in part, an address mapping scheme to distribute fine-grained data, e.g., to improve memory bandwidth utilization.

Referring to fig. 9, in some embodiments, a memory controller 936 may coordinate access to one or more memory devices 910 from a host and one or more processing elements 932. Coordination between a host-side memory controller (e.g., one of memory controllers 703 in fig. 7) and memory controller 936 within one or more near memory compute modules 924 may be implemented, for example, with host-first request scheduling.

In some embodiments, the host-side memory controller may be unaware that the module-side memory controller 936 in the memory module may send a request to the memory device 910. Thus, timing issues may arise if both the host-side memory controller and the module-side memory controller 936 send an overlap request to the first one of the memory devices 910. Host-first request scheduling may solve this type of timing problem, e.g., by implementing a close page policy on the host side and host-first request scheduling on the module side. For example, with a close page policy, the host-side memory controller may expect its memory requests to memory device 910 to comply with one or more DDR timing constraints. Because the module-side memory controller 936 may issue requests to the memory device 910, the latency of memory requests from the host-side memory controller may be unpredictable and thus may conflict with one or more DDR timing constraints. Thus, once the memory device 910 completes the current task, the host's prioritized request scheduling can service memory requests from the host. In some embodiments, one or more host side DDR timing parameters may be modified so that the host side memory manager may have a longer data return time expectation to allow the module side memory controller 936 to schedule requests.

The workload monitor 938 may monitor and/or cooperate with the input buffer 942 and/or one or more processing elements 932 to balance the workload of the processing elements, e.g., during k-mer counting, which may improve utilization of the processing elements 932.

A bus controller 940, which may be implemented as a bus arbiter, for example, may regulate data and/or C/A transfers over the hierarchical bus structure 914. In some embodiments, the bus controller 940 may allocate resources of the hierarchical bus structure to one or more processing elements 932.

Near data processing workflow

11A-11C illustrate example embodiments of a k-mer counting workflow according to example embodiments of the present disclosure. Fig. 11A illustrates an embodiment of a first bloom filter building operation and a second bloom filter building operation according to an example embodiment of the present disclosure. Fig. 11B illustrates an embodiment of first and second merge operations according to an example embodiment of the present disclosure. Fig. 11C illustrates an embodiment of a counting operation according to an example embodiment of the present disclosure. For purposes of illustration, the embodiments shown in fig. 11A-11C may be described in the context of the system shown in fig. 7, but the workflow may be used with any suitable system. The embodiments shown in fig. 11A-11C may each include a host CPU 1102, one or more memory controllers 1103, and one or more memory modules 1104.

To achieve parallel processing in a k-mer counting workflow, an input data set, which may include, for example, a DNA sequence, may be evenly partitioned into portions that may be distributed to different memory modules 1104, in this example, memory modules 1104 may be implemented as DIMMs. During the first bloom filter construction operation (r), as shown in fig. 11A, each memory module 1104 can independently construct a first local bloom filter, as indicated by the arrow. During operation, each memory module can read a local portion of a data set and use the data to build a first local bloom filter that can also be stored locally. During this operation, inter-module communication may be reduced or eliminated by using task segmentation and/or data localization.

After the first bloom filter build is complete, the local bloom filters may be simplified and dispersed during a first merge operation (c) that may synchronize bloom filters in different memory modules 1104, as shown in FIG. 11B. In some embodiments, a bloom filter may contain only 0 and 1, and thus, the bloom filter may be simplified as indicated by the unshaded arrows, e.g., by using an OR operation between corresponding bloom filter entries. After simplification, the simplified bloom filter may be dispersed as indicated by the shaded arrows, for example, by broadcasting the simplified bloom filter to the memory module 1104. In some embodiments, the simplified and/or decentralized operations may involve only sequential read and/or write operations of data transfer.

After synchronization of the first bloom filters in the memory modules 1104, each memory module 1104 may independently and/or in parallel construct a second bloom filter during a second bloom filter construction operation (c), as indicated by the arrows in FIG. 11A. The operation may be similar to the first bloom filter construction operation, except that the memory module may use the merged bloom filter from the first merge operation as input to construct the second bloom filter. The second bloom filter constructed at each memory module 1104 may be stored locally at the corresponding memory module. As with the first bloom filter build operation, inter-module communication may be reduced or eliminated by using task segmentation and/or data localization.

After the second bloom filter is constructed, the second bloom filter may be simplified and dispersed during a second merge operation (r) that may synchronize the second bloom filter in different memory modules 1104, as shown in fig. 11B. In some embodiments, the second bloom filter may contain only 0 and 1, and thus, the second bloom filter may be simplified as indicated by the unshaded arrows, e.g., by using an OR operation between corresponding bloom filter entries. After simplification, the second simplified bloom filter may be dispersed as indicated by the shaded arrows, for example, by broadcasting the simplified bloom filter to the memory module 1104. As with the first merge operation, in some embodiments, the second simplified and/or scatter operation may involve only sequential read and/or write operations of data transfers.

After the construction of the second bloom filter is complete, each memory module 1104 may contain a copy of the second bloom filter. The k-mer counting operations may then be performed in parallel at the memory modules. During the k-mer counting operation, for each k-mer, the memory module may first check a second bloom filter, as indicated by the unshaded arrows in FIG. 11C, to determine whether the current k-mer is non-unique. If the current k-mer is non-unique, the entry in the hash table corresponding to the current k-mer, which may be distributed among different memory modules 1104, may be accessed and incremented by 1 as indicated by the shaded arrow in FIG. 11C. In some embodiments, this operation may reduce or eliminate inter-module communication, for example, because inter-module communication may only be used for verified non-unique k-mers.

Thus, in the embodiments illustrated in FIGS. 11A-11C, the processing for k-mer counting may be performed in parallel by the memory module 1104, depending on implementation details. Furthermore, the local construction of the first bloom filter and the second bloom filter may improve system performance because it may involve little or no inter-module communication. The simplification and decentralization of the first and second bloom filters may involve only sequential read and/or write operations, which may have little or no impact on performance. Furthermore, unnecessary inter-module memory accesses may be avoided by checking a second bloom filter that may be stored locally during the k-mer count operation.

In some embodiments, and depending on implementation details, using two bloom filters in series in a system and/or workflow where an input data set may be distributed to multiple memory modules may result in one or more false positive results. This can be illustrated in FIG. 12, which shows the k-mer counting method using a global dataset on the left side of the graph. The DNA sequence 1211, which includes three instances of the 3-mer ATC, can be applied as input without any segmentation. The first global bloom filter 1213 may read three instances of the ATC3-mer and pass the 3-mer to the second global bloom filter 1215 because the first global bloom filter 1213 may recognize that the ATC3-mer is non-unique. The second global bloom filter 1215 may then increment the count in the hash table 1217 corresponding to the ATC 3-mer.

However, if a data set including a DNA sequence is distributed to three different storage modules, and only one instance of ATC3-mer 1211a is distributed to each storage module as shown on the right side of fig. 12, the first local bloom filter 1219 at each storage module may determine that the ATC3-mer is unique, and thus not pass the ATC3-mer to the second local bloom filter 1221 at the storage module. Thus, there may be no entry for the ATC3-mer in the final hash table 1223.

In some embodiments, false positives may be reduced or eliminated by using a counting bloom filter at each memory module. The counting bloom filter may use n independent hash functions in a manner similar to a bloom filter having a single bit array. However, the counting bloom filter may contain an array of counters corresponding to possible hash values that may be computed by the n hash functions, instead of a single bit array. For example, a counting bloom filter with a 4-bit counter array may be able to track zero to 15 instances of a particular k-mer. Thus, for a particular k-mer (x), if n ═ 3, the counters corresponding to the computed hash values h1(x), h2(x), and h3(x) may each be incremented by 1. To find a particular k-mer in the counting bloom filter, n separate hash values for the particular k-mer may be computed using n separate hash functions, and a counter corresponding to the computed hash value may be read. The value of the counter with the smallest count may be assumed to be the number of instances of a particular k-mer.

FIG. 13 illustrates an embodiment of a k-mer counting method workflow according to an example embodiment of the present disclosure. In the embodiment illustrated in FIG. 13, a data set comprising three instances of the ATC3-mer may be distributed to three different memory modules, and only one instance of the ATC3-mer 1311 may be distributed to each memory module. However, each memory module may include a local count bloom filter 1319, which may count one instance of ATC, and count 1 may be passed to a global bloom filter 1321. Thus, the ATC may be identified as a non-unique 3-mer and added to the final hash table 1323.

14A-14C illustrate an exemplary embodiment of a k-mer counting workflow using a counting bloom filter according to an exemplary embodiment of the present disclosure. Fig. 14A illustrates an embodiment of a counting bloom filter build operation, according to an example embodiment of the present disclosure. Fig. 14B illustrates an embodiment of a merge operation according to an example embodiment of the present disclosure. Fig. 14C illustrates an embodiment of a counting operation according to an example embodiment of the present disclosure. The embodiments shown in fig. 14A-11C can each include a host CPU 1402, one or more memory controllers 1403, and one or more memory modules 1404. For purposes of illustration, the embodiments shown in fig. 14A-14C may be described in the context of the system shown in fig. 7, but the workflow may be used with any suitable system.

As with the embodiment shown in fig. 11A-11C, to enable parallel processing in the k-mer counting workflow in the embodiment shown in fig. 14A-14C, an input data set, which may include, for example, a DNA sequence, may be evenly divided into portions that may be distributed to different memory modules 1404, in this example, the memory modules 1404 may be implemented as DIMMs. During a count bloom filter build operation as shown in FIG. 14A, each memory module 1404 may build a local count bloom filter independently, as indicated by the arrows. During this operation, each memory module may read a local portion of the data set and use the data to build a local count bloom filter, which may also be stored locally. During this operation, inter-module communication may be reduced or eliminated by using task segmentation and/or data localization.

After the count bloom filter build is complete, the local count bloom filters may be simplified and dispersed during the merge operation, which may synchronize the count bloom filters in different memory modules 1404, as shown in FIG. 14B. In some embodiments, the counting bloom filter may be simplified by adding a corresponding entry (e.g., counter) of the counting bloom filter as shown by the unshaded arrow. After the entry has been added, the corresponding entry in the merged bloom filter may be set to 1 if the counter entry in the simplified bloom filter is greater than 2, and may be set to zero otherwise. In some embodiments, the merged counting bloom filters may be dispersed by distributing the merged bloom filters to each memory module 1404 as indicated by the shaded arrows. In some embodiments, the simplified and/or decentralized operations may involve only sequential read and/or write operations of data transfer.

After the merged bloom filters are dispersed, each memory module 1404 may contain a copy of the merged bloom filter. The k-mer count operations may then be performed in parallel at the memory module 1404. During the k-mer counting operation, for each k-mer, the memory module may first check a second bloom filter, as indicated by the unshaded arrow, to determine whether the current k-mer is non-unique. If the current k-mer is non-unique, the entry in the hash table corresponding to the current k-mer, which may be distributed among different memory modules 1404, may be accessed and incremented by 1 as indicated by the shaded arrow. In some embodiments, this operation may reduce or eliminate inter-module communication, for example, because inter-module communication may only be used for verified non-unique k-mers.

Thus, in the embodiment illustrated in FIGS. 14A-14C, the processing for k-mer counting may be performed in parallel by the memory module 1404, depending on implementation details. Furthermore, the local construction of the counting bloom filter may improve system performance, as it may involve little or no inter-module communication. The simplification and decentralization of counting bloom filters may involve only sequential read and/or write operations with little or no impact on performance. Furthermore, unnecessary inter-module memory accesses may be avoided by checking merged bloom filters that may be stored locally during k-mer counting operations.

In some embodiments, and depending on implementation details, systems, methods, and/or devices according to example embodiments of the present disclosure may reduce or prevent communication bottlenecks by implementing workflows that may reduce or eliminate unnecessary inter-module memory accesses, e.g., by partitioning an input data set for k-mer counting into multiple partitions, and localizing partitions and/or operations on partitions into multiple memory modules that may operate with a large degree of parallelism. In some embodiments, and depending on implementation details, systems, methods, and/or devices according to example embodiments of the present disclosure may reduce or prevent communication bottlenecks by implementing a hierarchical bus that may include inter-level C/a buses and/or inter-level data buses, thereby reducing or minimizing inter-module communication.

Bandwidth utilization

FIG. 15A illustrates an embodiment of address mapping for a memory device according to an example embodiment of the present disclosure. The mapping shown in fig. 15A may start with a 2-bit channel address at the Most Significant Bit (MSB) position, followed by a 4-bit level address, a 4-bit device address, a 4-bit bank (bank) address, a 16-bit row address, a 10-bit column address, a 3-bit burst address, and a 2-bit width at the Least Significant Bit (LSB) position. The mapping shown in FIG. 15A may consolidate data within the memory device, which may take advantage of data locality in some embodiments. However, applications such as k-mer counting may involve a large number of fine-grained (e.g., 1-bit or 2-bit) random memory accesses, which may have little or no locality. Thus, depending on implementation details, memory bandwidth may not be fully utilized.

Some embodiments according to example embodiments of the present disclosure may implement a distributed data mapping scheme to distribute data across memory devices, ranks, memory modules, and the like.

FIG. 15B illustrates another embodiment of address mapping for a memory device according to an example embodiment of the present disclosure. The mapping shown in fig. 15B may start with a 2-bit channel address at the MSB position, followed by a 16-bit row address, a 4-bit bank address, a 4-bit level address, a 4-bit device address, a 10-bit column address, a 3-bit burst address, and a 2-bit width at the LSB position. The mapping shown in FIG. 15B may reorder the address bits of the memory devices to prioritize distribution data among different memory devices (e.g., DRAM chips). Depending on implementation details, this may improve memory bandwidth utilization, for example, by reducing or eliminating concentration of data within memory devices, ranks, memory modules, and so forth.

Workload balancing

In some embodiments, a workload monitor, such as workload monitor 938 shown in fig. 9, may implement a task scheduling scheme based on monitoring the workload of one or more computing resources. For example, the task scheduling scheme may balance the workload among the processing elements 932 in the near memory computing module 924.

In some embodiments, the task scheduling scheme may utilize the input buffer 942 as a queue for incoming tasks. The workload monitor 938 may monitor the status of one or more processing elements 932 and the input buffer 942. If one of the processing elements 932 completes a task, the workload monitor 938 may select an input data unit from the input buffer 942 and dispatch it as a task to the processing element 932 to keep it busy. Thus, in some embodiments, the workload monitor 938 may implement a task scheduling scheme by dynamically dispatching tasks to the processing elements 932 at a fine granularity. Depending on implementation details, this may increase utilization of one or more computing resources.

Redundant memory access

FIG. 16A illustrates an example embodiment of a first memory access method of a bloom filter in accordance with an example embodiment of the present disclosure. The embodiment shown in fig. 16A may use four hash functions (n-4). Thus, four bloom filter entries may be examined for each k-mer. During the k-mer counting operation, all bloom filter entries in the merged bloom filter that are associated with the current k-mer may be checked to verify that they are 1's. If all corresponding bloom filter entries are 1, the hash table may be updated. Otherwise, the write operation cannot be performed. However, in some embodiments, sequential memory accesses of the merged bloom filter may result in wasted memory bandwidth as described below.

In the embodiment shown in FIG. 16A, the memory accesses of the merged bloom filter may be issued sequentially for each task (which may be performed by different processing elements for each k-mer). The top sequence in FIG. 16A shows the sequence of bloom filter accesses, where requests for the same task (k-mer and processing element) are represented by the same type of shading. The bottom sequence in FIG. 16A shows the data returned to the processing element for each read request, where the data returned by a particular task is represented by the same type of shading as the task. Thus, four memory accesses for each k-mer may be issued sequentially, and data may also be returned sequentially.

The first bloom filter entry (not shaded) returned for the first task may be zero. However, if the first entry is zero, the next three entries (represented by the thick outline) may be irrelevant (e.g., "don't care" terms). Thus, the next three accesses may be unnecessary and memory bandwidth may be wasted.

Some embodiments may implement a decentralized memory access and/or task switching scheme. For example, memory accesses of one task may be interleaved with memory accesses of one or more other tasks. Depending on implementation details, this may reduce or eliminate unavailable data access. Additionally or alternatively, tasks may switch between memory accesses, e.g., filling memory access slots with accesses of available data. Depending on implementation details, this may improve memory bandwidth utilization.

FIG. 16B illustrates an example embodiment of a second memory access method of a bloom filter according to an example embodiment of the present disclosure. In the embodiment shown in FIG. 16B, memory accesses for different tasks (k-mer and bloom filter) may be distributed and/or published at intervals. For example, a subsequent memory access for a particular task can only be issued if all previous accesses return a 1. Thus, as shown in FIG. 16B, subsequent memory accesses may not be issued after the first accesses of the first two tasks, because the first accesses of each of the first two tasks return a 0. However, because the first memory access of the third task may return 1, the second memory access (BF-1) of the third task may be issued after the time interval. In some embodiments, and depending on implementation details, this may reduce or eliminate redundant memory accesses and/or more efficiently utilize available memory bandwidth.

Although redundant memory accesses may be reduced or eliminated with scattered memory accesses, in some embodiments memory bandwidth may still be wasted, for example, due to a lack of memory accesses to fill the time interval between memory accesses of a task. Thus, some embodiments may switch tasks between memory accesses, as shown in FIG. 16B. For example, after issuing a memory access for a first task of a first k-mer, a processing element may switch to a second task belonging to a second k-mer while waiting for return data of the first task. Thus, the time interval incurred between scattered memory accesses may be filled with memory accesses belonging to different tasks and/or processing elements. In some embodiments, and depending on implementation details, this may further improve memory bandwidth utilization. In some embodiments, a combination of scattered memory accesses and task switching may produce a synergistic effect that may reduce or eliminate redundant memory accesses and/or efficiently utilize memory bandwidth.

Fig. 17 shows an embodiment of a method of processing a data set according to an example embodiment of the present disclosure. The embodiment shown in fig. 17 may begin at operation 1702. At operation 1704, the method may distribute a first portion of the data set to a first memory module. At operation 1706, the method may distribute the second portion of the data set to a second memory module. At operation 1708, the method can build a first local data structure at a first memory module based on a first portion of the data set. At operation 1710, the method may build a second local data structure at a second memory module based on a second portion of the data set. At operation 1712, the method may merge the first local data structure and the second local data structure. The method may end at operation 1714.

The embodiment shown in fig. 17, as well as all other embodiments described herein, are example operations and/or components. In some embodiments, some operations and/or components may be omitted, and/or other operations and/or components may be included. Moreover, in some embodiments, the temporal and/or spatial order of the operations and/or components may be varied. Although some components and/or operations may be illustrated as separate components, in some embodiments some components and/or operations illustrated separately may be integrated into a single component and/or operation and/or some components and/or operations illustrated as single components and/or operations may be implemented with multiple components and/or operations.

The embodiments disclosed above have been described in the context of various implementation details, but the principles of the disclosure are not limited to these or any other specific details. For example, some functions have been described as being performed by certain components, but in other embodiments, functions may be distributed among different systems and components in different locations, and with various user interfaces. Certain embodiments have been described as having particular processes, operations, etc., but these terms also include embodiments in which a particular process, operation, etc., may be implemented using multiple processes, operations, etc., or in which multiple processes, operations, etc., may be integrated into a single process, step, etc. References to a component or element may refer to only a portion of the component or element. For example, a reference to an integrated circuit may refer to all or only a portion of the integrated circuit, and a reference to a block may refer to the entire block or one or more sub-blocks. The use of terms such as "first" and "second" in the present disclosure and claims may be used solely to distinguish between the things they modify and may not indicate any spatial or temporal order unless apparent from the context. In some embodiments, a reference to something may refer to at least a portion of something, e.g., "based on" may mean "based at least in part on," "access" may mean "accessed at least in part," etc. Reference to a first element does not imply the presence of a second element. For convenience, various organizational aids may be provided, such as section headings and the like, but the subject matter arranged in accordance with these aids and the principles of the present disclosure are not limited to these organizational aids.

The various details and embodiments described above may be combined to produce additional embodiments in accordance with the inventive principles of this patent disclosure. Since the inventive principles of this patent disclosure may be modified in arrangement and detail without departing from the inventive concepts, such changes and modifications are intended to fall within the scope of the appended claims.

39页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于静态随机存取内存写入辅助的装置与方法

Systems, methods, and apparatus for near data processing

相关技术

网友询问留言