Optimizing storage and retrieval of compressed data

文档序号：1909827 发布日期：2021-11-30 浏览：2次中文

阅读说明：本技术 优化压缩数据的存储和获取 (Optimizing storage and retrieval of compressed data ) 是由罗纳德·拉伊·特林布尔于 2019-04-29 设计创作，主要内容包括：在一些示例中,计算机系统可以接收数据对象的多个数据块。该系统可以压缩多个数据块以获得多个压缩块,并且可以确定多个压缩块一起是否小于阈值大小。基于确定多个压缩块一起小于阈值大小,该系统可以向映射数据结构中的各个条目加入多个压缩块的相应大小。另外,该系统可以通过组合各个条目中的至少两个条目的值来压缩映射数据结构,并且可以存储多个压缩块和压缩的映射数据结构。(In some examples, a computer system may receive a plurality of data blocks of a data object. The system may compress the plurality of data blocks to obtain a plurality of compressed blocks, and may determine whether the plurality of compressed blocks together are less than a threshold size. Based on determining that the plurality of compressed blocks are together less than the threshold size, the system may add respective sizes of the plurality of compressed blocks to respective entries in the mapping data structure. In addition, the system may compress the mapping data structure by combining values of at least two of the respective entries, and may store a plurality of compressed blocks and the compressed mapping data structure.)

1. A system, comprising:

one or more processors; and

one or more non-transitory computer-readable media storing executable instructions that, when executed by the one or more processors, configure the one or more processors to perform operations comprising:

receiving a plurality of data blocks of a data object;

compressing the plurality of data blocks to obtain a plurality of compressed blocks;

determining whether the plurality of compressed blocks together are less than a first threshold size;

based on determining that the plurality of compressed blocks together are less than the first threshold size, adding respective sizes of the plurality of compressed blocks to respective entries in a mapping data structure;

reducing the mapping data structure by combining values of at least two of the entries; and

storing the plurality of compressed blocks and the reduced mapping data structure.

2. The system of claim 1, wherein the plurality of blocks is a first plurality of blocks of the data object and the plurality of compressed blocks is a first plurality of compressed blocks, the operations further comprising:

receiving a second plurality of data chunks of the data object;

compressing the second plurality of data blocks to obtain a second plurality of compressed blocks;

determining whether the second plurality of compressed blocks together is less than the first threshold size;

based on determining that the second plurality of compressed blocks together are not less than the first threshold size, adding a value to each entry in the mapping data structure indicating that the second plurality of blocks are uncompressed; and

storing the second plurality of compressed blocks in uncompressed form with the second plurality of blocks.

3. The system of claim 1, wherein at least one of the plurality of data blocks included with the plurality of compressed blocks is incompressible, the plurality of compressed blocks being compressible in their entirety, the operations further comprising:

including predetermined values for the at least one block into corresponding entries in the mapping data structure, the predetermined values having a size indicating that the at least one block is uncompressed; and

assigning an overflow value corresponding to a difference between the predetermined value and an actual size value of the at least one block to at least one other entry in the mapping data structure corresponding to at least one of the plurality of compressed blocks.

4. The system of claim 3, the operations further comprising:

determining whether each of the compressed blocks individually is less than a second threshold size;

determining that the at least one block is incompressible based on determining that the actual size value of the at least one block of the plurality of blocks exceeds the second threshold size after attempting compression;

inputting the predetermined value as a compressed size into a corresponding entry of the at least one block in the mapping data structure; and

assigning the overflow value by adding a difference between the actual size value and the predetermined value to the compressed size of the at least one compressed block, the compressed size of the at least one compressed block being less than the second threshold size.

5. The system of claim 3, wherein the predetermined value is combined with values of adjacent entries in the mapping data structure during compaction of the mapping data structure.

6. The system of claim 1, the operations further comprising generating a header for at least one of the mapping data structure and the reduced mapping data structure, the header comprising an indication of a storage location and an indication of a block size of the reduced mapping data structure.

7. The system of claim 6, wherein storing the reduced mapping data structure comprises at least one of:

storing the reduced mapping data structure in association with the header in a metadata database; or

After compressing at least a portion of the data object, the reduced mapping is stored in association with the data object at a network storage location.

8. The system of claim 1, the operations further comprising:

receiving a request to access data in the data object after at least partial compression of the data object;

determining a data range to be acquired from the data object;

based on the range, reading a portion of the data object using the reduced mapping data structure after at least partial compression of the data object;

decompressing compressed data in the portion of the data object to obtain the data range; and

returning the data range in response to the request.

9. The system of claim 8, wherein the mapping data structure includes at least one offset indicator as a reference point for the entry in the mapping data structure, each offset indicator being associable with a respective location in the data object in an uncompressed state, wherein reading a portion of the data object using the reduced mapping data structure comprises:

determining a location in the data object in uncompressed form corresponding to the request;

associating the location with an offset indicator in the mapping data structure as a starting location in the data object after at least partial compression of the data object; and

reading data based on the starting location and the entry in the reduced mapping data structure.

10. The system of claim 8, wherein reading a portion of the data object using the reduced mapping data structure comprises:

determining a location in the data object in uncompressed form corresponding to the request;

associating the location in the data object with a location in the reduced mapping data structure based on a compressed size of at least one block indicated in at least one entry in the reduced mapping data structure to determine a starting location in the data object; and

reading data of the data object based at least on the starting position.

11. The system of claim 1, the operations further comprising compacting the mapping data structure a plurality of times until a size of the compacted mapping data structure is less than a threshold mapping size.

12. The system of claim 1, wherein the mapping data structure includes an offset indicator as a reference point for the plurality of compressed blocks, the offset indicator being associable with a location in the data object in an uncompressed state, the offset indicator indicating a total size of compressed blocks and uncompressed blocks corresponding to the entry preceding the offset indicator in the mapping data structure.

13. One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, configure the one or more processors to perform operations comprising:

receiving a plurality of data blocks of a data object;

compressing the plurality of data blocks to obtain a plurality of compressed blocks;

determining whether the plurality of compressed blocks together are less than a threshold size;

based on determining that the plurality of compressed blocks together are less than the threshold size, adding respective sizes of the plurality of compressed blocks to respective entries in a mapping data structure;

reducing the mapping data structure by combining values of at least two of the entries; and

storing the plurality of compressed blocks and the reduced mapping data structure.

14. A method, comprising:

receiving, by one or more processors, a plurality of data blocks of a data object;

compressing the plurality of data blocks to obtain a plurality of compressed blocks;

determining whether the plurality of compressed blocks together are less than a threshold size;

reducing the mapping data structure by combining values of at least two of the entries; and

storing the plurality of compressed blocks and the reduced mapping data structure.

15. The method of claim 14, further comprising:

including at least one offset indicator into the mapping data structure as a reference point for the plurality of compressed blocks, the offset indicator being capable of being correlated to a location in the data object in an uncompressed state,

wherein the offset indicator indicates a total size of compressed blocks and uncompressed blocks corresponding to an entry preceding the offset indicator in the mapping data structure.

Technical Field

The present disclosure relates to the field of compressed data storage and retrieval.

Background

Data stored in a storage system may sometimes be compressed to increase the amount of data that can be stored in the storage system. As one example, the compressed data object may be stored over a network at one or more network storage locations. If a user wishes to access a portion of a stored data object by range reading, all object data preceding the requested portion may be retrieved from network storage and decompressed, typically. For example, conventional techniques may not be able to begin decompression only at a particular requested location without determining compression of previous data. Thus, it is generally possible to decompress all of the previous data to achieve decompression and access to the desired portion. This process may be computationally inefficient and time consuming.

Disclosure of Invention

Some implementations include a computer system that can receive multiple data blocks of a data object. The system may compress the plurality of data blocks to obtain a plurality of compressed blocks, and may determine whether the plurality of compressed blocks together are less than a threshold size. Based on determining that the plurality of compressed blocks are smaller than the threshold size together, the system may add respective sizes of the plurality of compressed blocks to respective entries in the mapping data structure. In addition, the system may reduce the mapping data structure by combining values of at least two of the respective entries, and may store a plurality of compressed blocks and the reduced mapping data structure.

Drawings

The detailed description is set forth with reference to the accompanying drawings. In the drawings, the left-most digit or digits of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features.

FIG. 1 illustrates an example architecture of a system capable of storing and accessing compressed data according to some implementations.

FIG. 2 illustrates an example data structure including each block size and a map size of the data size according to some implementations.

FIG. 3 illustrates an example data structure representing allocation (distribution) of special value overflow (overflow) and mapreduce (compact) according to some implementations.

Fig. 4 illustrates an example of the operation of an allocation algorithm according to some implementations.

Fig. 5 illustrates an example data structure and layout of a map header (header) according to some implementations.

FIG. 6 illustrates an example data structure configuration of a map according to some implementations.

FIG. 7 illustrates an example data structure configuration of a map according to some implementations.

FIG. 8 illustrates an example data structure configuration of a mapping after one compaction according to some implementations.

Fig. 9 is a flow diagram illustrating an example process for compression and map generation according to some implementations.

FIG. 10 is a flow diagram illustrating an example process that is a continuation of the process of FIG. 9, according to some implementations.

Fig. 11 is a flow diagram illustrating an example process for map-based decompression according to some implementations.

FIG. 12 is a flow diagram illustrating an example process that is a continuation of the process of FIG. 11, according to some implementations.

Detailed Description

Some implementations herein relate to techniques and arrangements for enabling high performance range reading of compressed data and/or compressed data mixed with uncompressed data. In some cases, the data may be stored in a storage location in a cloud or otherwise stored over a network storage location over a network. In some examples, a mapping data structure (hereinafter "map") may be generated and maintained to associate compressed and uncompressed locations in the data. For example, as data is being compressed, a map may be created to associate locations in the original uncompressed object with offsets in the compressed data. Furthermore, when access to a range of object data of a compressed object is requested, e.g., based on a user request or the like, mapping may be used to enable decompression to start at an offset just before the requested range, rather than having to decompress the entire compressed object.

In some examples, the data may be divided and compressed into chunks (e.g., having a specified default size, e.g., 32K, 64K, 128K, 256K, etc.). For example, the chunk may have a specified chunk size that may correspond to the number of bytes in the original object for each compression request. The block size of the last block in the object may be smaller than the block sizes of other blocks in the object. After each block is compressed, an entry may be added to the map for each block indicating the size of the block. In addition, a map header may be generated to describe a map that includes information such as block size. The mapping header may be accessed prior to mapping, and in some cases, may be used to locate the mapping. The mapping itself may be appended to or may otherwise follow the mapping header, or the mapping may be appended to the compressed object.

In some cases, the map may be reduced to a smaller size, for example, to enable the map to better fit the memory completely. To thin the map, the thin region cannot contain a mixture of compressed and uncompressed data. To achieve this, streaming object data may be compressed into blocks (e.g., 64K) having a larger streaming set size (e.g., 512K). If the entire set of blocks (called "operations" in some examples) that is a region is incompressible, then all block sizes in the set may be recorded in the map as compressed size "0" and the compression is not used for any corresponding blocks. The set of uncompressed blocks in the compaction map is done by adding all 0-sized blocks, which results in 0-sized compacted blocks that are still not compressed.

Implementations herein provide optimized range read performance for compressed files and other data objects. Examples may include relational compression mapping that associates compressed and uncompressed locations and techniques to minimize the size of the mapping — e.g., limiting the size of the mapping to 2 bytes per 64K of compressed data. Further, examples herein provide the ability to compress streaming files and other streaming data without knowing the initial size and compressibility of the data and use compression only over areas where compression is beneficial.

Furthermore, with the mapreduce techniques herein, the entire mapping may be kept in memory during compression, which is not possible with conventional mapping techniques because the mapping may become too large. Additionally, some examples allow for an initial optimal block size (e.g., 64K) and enable a reduction of mapping as it gets larger, so that the mapping can be kept in memory. In addition, the compressible region and the incompressible region may be combined during compaction without using additional bytes in the mapping. In some examples, an impossible size of zero or other suitable indicator may be used to indicate that compression is not used on a certain data region.

Additionally, examples herein overcome challenges presented when a collection of tiles is compressible but one or more individual tiles are not. For example, when using block sizes like 64K or 16M, the optimized size field of 2 or 3 bytes, respectively, fits perfectly into the potential size without any room for including size increases. After specifying that each set (operation) saves at least 2 bytes per 64K to use compression, a special compressed size of 64K-1 (or multiples thereof for larger blocks) may be used in the mapping to indicate that there is no benefit to using compression. Additional bytes may be added to neighbors in the set using a reduced algorithm that allows simple addition of neighbors. If the set is of a particular size, the individual block locations in the compressed object cannot be determined, and decompression may begin on the set boundary where the locations can be determined.

Additionally, some examples include optimizations to allow incompressible blocks to avoid compression. For example, each block may be represented by a compressed size of 0 to indicate that compression is not used. If the compression of the block is not sufficient to make it worth using compression, then uncompressed data may be included in the compressed object instead of compressed data, and a 0 may be associated with the uncompressed data in the map to indicate that compression is not used.

Further, some examples include map reduction to minimize map size. For example, each entry in the map may be a block size representation associated with a compressed block size. This means that each entry in the map may only need to be large enough to contain the block size of the corresponding block, thus assuming that compression saves some space. Thus, a block size of 64K or less may use only two bytes in the mapping for each block. For block sizes larger than 64K but smaller than 16MB, it is possible to use three bytes for each block in the mapping. Furthermore, in the case where the block size of the block is larger than 16MB, four bytes may be used for each data block in the mapping.

Additionally, some examples include optimizations to allow streaming of data through periodic map reduction. Initially, in the case of compressed streaming objects, the full size of the streaming object is typically unknown. Streaming data may be compressed and stored at a network storage location (also referred to as cloud storage in some examples). To efficiently handle map creation, the map may be kept in memory until the map is written to the cloud with the compressed object, and in some cases, the map may be reduced periodically to keep the map size below a specified threshold size. By keeping the entire mapping in memory, it is not necessary to write the mapping to local storage (e.g., a hard disk drive or solid state drive) during creation of the mapping, and then read the mapping again after the compression is complete to send the completed mapping to a network storage location for storage in association with the compressed data. In some examples, the mapping may be appended to the compressed object after compression of the object is complete, such that the mapping may be included in the compressed size of the data object (see, e.g., fig. 2).

As one non-limiting example, when a 64KB block size per block is used, a mapping of a 2TB object can use approximately 62 MB. To handle compression of such large objects while keeping the mapping in memory, the mapping herein may be reducible. The compaction mapping may be accomplished by combining adjacent entries and doubling the block size. For example, assume a 62MB mapping of a 2TB object is reduced until the block size represented by the mapping entry is 1MB instead of 64KB per block. In this case, the mapping may use less than 6MB instead of 62 MB.

In the example herein, the undesired compaction results in a mixture of compressed blocks and uncompressed blocks. Therefore, the block size at maximum reduction can be used as the operation size. For example, the operations may include a set of blocks that all use compression or that all skip compression. Thus, the system may determine whether compression should be used after collecting results from all blocks included in the operation. For example, each operation may be a set of power-of-2 blocks. All blocks in an operation are compressed before determining whether the operation should be compressed. If the operation is incompressible (e.g., compression results in less than a threshold amount of overall size reduction), the map is updated with 0 for each block in the operation to indicate that no compression is used. If compression is used for any block in an operation, compression is used for all blocks in the operation.

For purposes of discussion, some example implementations are described in the context of one or more serving computing devices in communication with one or more network storage locations and one or more user devices capable of optimizing data compression. However, implementations herein are not limited to the specific examples provided and may extend to other types of computing system architectures, other types of storage environments, other types of client configurations, other types of data, and so forth, as will be apparent to those of skill in the art in light of the disclosure herein.

FIG. 1 illustrates an example architecture of a system 100 capable of storing and accessing compressed data according to some implementations. The system 100 includes one or more service computing devices 102, the one or more service computing devices 102 capable of communicating or otherwise being coupled with at least one storage system 104, such as through one or more networks 106. Further, the one or more service computing devices 102 can communicate with one or more user computing devices 108 (e.g., user devices 108(1), 108(2), … …) over the one or more networks 106, the user computing devices 108 can be any of various types of computing devices, as discussed further below.

In some examples, the one or more service computing devices 102 may include one or more servers that may be implemented in any number of ways. For example, the programs, other functional components, and at least a portion of the data store of one or more service computing devices 102 may be implemented on at least one server, such as in a server cluster, server farm (server farm), data center, cloud-hosted computing service, and so forth, although other computer architectures may additionally or alternatively be used. In the illustrated example, the one or more service computing devices 102 include or may be associated with one or more processors 110, one or more computer-readable media 112, and one or more communication interfaces 114.

Each processor 110 may be a single processing unit or multiple processing units and may include a single or multiple computing units or multiple processing cores. The one or more processors 110 may be implemented as one or more central processing units, microprocessors, microcomputers, microcontrollers, digital signal processors, state machines, logic circuitry, and/or any device that manipulates signals based on operational instructions. As one example, the one or more processors 110 may include one or more hardware processors and/or any suitable type of logic circuitry specifically programmed or configured to perform the algorithms and processes described herein. The one or more processors 110 may be configured to retrieve and execute computer-readable instructions stored in the computer-readable medium 112, which may program the one or more processors 110 to perform the functions described herein.

Computer-readable media 112 may include memory 113 and local storage 115. For example, memory 113 may include volatile and nonvolatile memory and/or removable and non-removable media implemented in any type of information storage technology for example, computer readable instructions, data structures, program modules or other data. For example, memory 113 may include, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology. Furthermore, local storage 115 and other computer-readable media 112 may include optical storage, solid state memory, magnetic tape, disk storage, RAID storage systems, memory arrays, network attached storage, storage area networks, cloud storage, or any other medium that can be used to store the desired information and that can be accessed by a computing device. Depending on the configuration of the service computing device 102, the computer-readable media 112 may include tangible, non-transitory media, such that, when mentioned, the non-transitory computer-readable media does not include media such as energy, carrier signals, electromagnetic waves, and/or signals per se. In some cases, the computer-readable medium 112 may be co-located with the service computing device 102, while in other examples, a portion of the computer-readable medium 112 may be partially remote from the service computing device 102. For example, in some cases, computer-readable media 112 may comprise a portion of memory in one or more storage systems 104.

The computer-readable medium 112 may be used to store any number of functional components that may be executed by the one or more processors 110. In many implementations, these functional components include instructions or programs that are executable by the one or more processors 110, which when executed, specifically program the one or more processors 110 to perform the actions ascribed herein to the service computing device 102. Functional components stored in computer-readable medium 112 may include a server program 116 and a storage program 118, each of which may include one or more computer programs, applications, executable code, or portions thereof. For example, the server program 116 may provide communication functionality with the user device 108 and one or more storage systems 104.

Storage program 118 may be configured to perform the data compression/decompression and mapping generation and management functions described herein. In addition, the storage program 118 may include database management functionality for creating and managing a metadata Database (DB)120, the metadata Database (DB)120 containing metadata related to data stored and managed by one or more of the service computing devices 102. For example, storage program 118 may include executable instructions configured to cause storage program 118 to maintain a file system, object information, data management information, and other information that is part of metadata database 120. Storage program 118 may also perform administrative functions to manage other types of information included in metadata database 120, such as user information. Storage program 118 may also manage the storage of data stored at one or more network storage systems 104, e.g., to manage retention periods, storage protection levels, copy stored data to other sites for disaster recovery, etc.

In addition, computer-readable media 112 may store data, data structures, and other information for performing the functions and services described herein. For example, during object compression, the computer-readable medium 112 may store the mapping 122 of the compressed object 124 in the memory 113, and if the compressed object is too large to be stored in the memory 113, the compressed object may be stored in the local storage 115. Additionally, computer-readable medium 112 may store metadata DB 120, which metadata DB 120 is used by storage program 118 in performing some of the functions described herein, as discussed further below. In addition, the memory 113 may include an ingest buffer (ingest buffer)126 and an output buffer 128, which may be used to temporarily store at least a portion of the data, such as during compression and/or decompression of the data.

The service computing device 102 may also include or maintain other functional components and data (which may include programs, drivers, etc.) and data used or generated by the functional components. Further, the service computing device 102 may include many other logical, program, and physical components, of which the above are merely examples relevant to the discussion herein.

The one or more communication interfaces 114 may include one or more software and hardware components for enabling communications with various other devices, e.g., over the one or more networks 106. Thus, the communication interface 114 may include or may be coupled to one or more ports that provide a connection to one or more networks 106 to one or more storage systems 104, one or more other service computing devices 102, and user devices 108. For example, the one or more communication interfaces 114 may be through a LAN, the internet, a cable network, a cellular network, a wireless network (e.g., Wi-Fi) and a wired network (e.g., fibre channel, fiber optic, ethernet), direct connection, and elsewhere hereinBy external enumeration, e.g. BLUETOOTHAnd one or more of the near field communications.

The one or more networks 106 may include any suitable network, including: a wide area network, such as the Internet; a Local Area Network (LAN), such as an intranet; wireless networks, e.g., cellular networks, local wireless networks (e.g., Wi-Fi), and/or short-range wireless communications (e.g.,) (ii) a A wired network comprising fibre channel, fibre optic, ethernet or any other such network, a direct wired connection or any combination thereof. Thus, one or more of the networks 106 may include wired and/or wireless communication technologies. The components used for such communication may depend at least in part on the type of network, the selected environment, or both. Protocols for communicating over such networks are well known and will not be discussed in detail herein. Thus, the one or more service computing devices 102, the one or more network storage systems 104, and the user device 108 can communicate over the one or more networks 106 using wired or wireless connections, and combinations thereof.

Each user device 108 may be any suitable type of computing device, such as a desktop, laptop, tablet computing device, mobile device, smartphone, wearable device, and/or any other type of computing device capable of sending data over a network. Users 130(1), 130(2), … … may be associated with user devices 108(1), 108(2), … …, respectively, e.g., through respective user accounts, user login credentials, etc. Further, the user device 108 may be capable of communicating with one or more service computing devices 102 over one or more networks 106, over a separate network, or over any other suitable type of communication connection. Many other variations will be apparent to those of ordinary skill in the art having the benefit of the disclosure herein.

Further, each user device 108(1), 108(2), … … may include a respective instance of a user application 136(1), 136(2), … … that may be executed on the respective user device 108(1), 108(2), … …, e.g., for communicating with the server program 116, e.g., for sending user data for storage on the one or more network storage systems 104 and/or receiving stored data from the one or more network storage systems 104. In some cases, the application 136 may include or may be operated by a browser, and the server program 116 may include a web application for enabling a user to access stored data through one or more service computing devices 102. Alternatively, in other cases, the application 136 may include any other type of application having communication functionality capable of communicating with the server program 116 over one or more networks 106.

The one or more storage systems 104 may include one or more storage computing devices 140, and the storage computing devices 140 may include one or more servers or any other suitable computing device, such as any of the examples discussed above with respect to the service computing device 102. The one or more storage computing devices 140 may each include one or more processors 142, one or more computer-readable media 144, and one or more communication interfaces 146. For example, processor 142 may correspond to any of the examples discussed above with respect to processor 110, computer-readable medium 144 may correspond to any of the examples discussed above with respect to computer-readable medium 112, and communication interface 146 may correspond to any of the examples discussed above with respect to communication interface 114.

Additionally, the computer-readable media 144 may include stored programs 148 that are functional components executed by the one or more processors 142 to manage the storage of data on memory 150 included in the one or more storage systems 104. The memory 150 may include one or more controllers 152 associated with the memory 150 for storing data on one or more arrays 154 of storage devices 156, and the like. For example, controller 152 may control array 154, e.g., to configure array 154 in a RAID configuration, JBOD configuration, etc., and/or to present logical units based on storage devices 156 to storage programs 148 and to manage data, e.g., data objects or other data, stored as cloud data 158 on underlying physical storage devices 156. The storage device 156 may be any type of storage device, such as a hard disk drive, solid state drive, optical drive, magnetic tape, combinations thereof, and the like. In some examples, the one or more network storage systems 104 may include commercially available cloud storage known in the art, while in other examples, the one or more network storage systems 104 may include private or enterprise storage systems or combinations thereof that are only accessible by entities associated with the one or more service computing devices 102.

In the system 100, users 130 may store data to and receive data from one or more service computing devices 102 with which their respective user devices 108 are in communication. Thus, one or more service computing devices 102 may provide local storage for users 130 and corresponding user devices 108. During steady state operation, there may be users 108 that periodically communicate with one or more service computing devices 102.

In some cases, the one or more service computing devices 102 may be arranged into one or more groups, clusters, systems, or the like. For example, in some examples (not shown in fig. 1), one or more service computing devices 102 may be arranged in pairs, where a first service computing device 102 may be coupled to a second service computing device 102 such that they together form a computing node for providing storage and data management services to a plurality of user devices 108. For example, at least in maintaining the metadata database 120, the first serving computing device 102 may act as a primary computing device, while the second serving computing device 102 may act as a secondary computing device. Additionally or alternatively, in some examples, one or more service computing devices 102 may be arranged in multiple clusters at geographically separate and remote locations to enable replication and disaster recovery. Many other configurations will be apparent to those of skill in the art having the benefit of the disclosure herein.

In the example of fig. 1, it is assumed that the first user 130(1) streams the data object indicated at 160 to the service computing device 102 for storage at the network storage system 104. The storage program 118 may be executed on the service computing device 102 to compress the streamed data object 160. When the object is streamed to the service computing device 102, the mapping 122 can be created and maintained in the memory 113 without storing the mapping 122 to disk in the local storage 115. As one example, the storage program 118 may initially use a default chunk size when generating the mappings 122, such as 64K chunks in 512K operations. This provides better performance during data acquisition than, for example, a single setup mapping that is generally expected to always fit into 256K blocks of memory. Thus, the default arrangement may reduce the mapping memory required for large objects by half while providing a quadruple range performance with only doubling the memory or mapping size requirements for most objects. Additional examples of map sizes associated with various object sizes and block sizes are discussed below, e.g., with respect to FIG. 2. Furthermore, these parameters may be optimized by the user based on the user's requirements.

When the object is finished streaming to and compressed by the service computing device 102, in some examples, the mapping 122 may be appended to the compressed object 124 and mapping 122 may be sent to the network storage system 104 over the one or more networks 106. The network memory 104 may store the compressed object 124 and the map 122 in the memory 150. In addition, a map header (not shown in FIG. 1) may be created for map 122 and may be stored in metadata database 120 in association with other metadata for object 124. Further, while in this example the mapping 122 may be appended to the compressed object 124, in other examples the mapping 122 may be stored in the network memory 104 separately from the associated compressed object 124. In other cases, the mapping 122 may be appended to a mapping header and stored in the metadata database 120.

Further, in some cases, one or more portions of compressed object 124 may be sent to network storage system 104 separately from one or more other portions of compressed object 124 and/or mapping 122. For example, when a portion of the streamed data object 160 is compressed, the compressed portion may be sent to the network storage system 104 independently, such as where the object is very large, or where it is desired to send data directly from the memory 113 instead of storing the compressed object to the local storage 115 and then reading the compressed object from the local storage 115. Many other variations will be apparent to those of ordinary skill in the art having the benefit of the disclosure herein.

Further, as an example, assume that user 130(2) subsequently sends a data request 170 requesting that a portion of compressed object 124 be distributed to user device 108 (2). Storage program 118 may access the mapping header, determine that the mapping is stored at network storage system 104, and may obtain mapping 122. Based on determining the location in the uncompressed object from which to begin reading data to obtain the requested portion of data, and based on the offset in the map 122 and other information discussed further below, the storage program 118 may send a read request 174 for a portion of the data object to the network storage system 104. In response, the network storage system 104 may send the requested data portion to the service computing device 102. Storage program 118 may decompress one or more portions of the requested data portion and may reply to data request 170 with uncompressed data portion 178.

FIG. 2 illustrates an example data structure 200 including each block size and a map size of the data size according to some implementations. The data structure 200 in this example shows the mapped size for each block size and data size (e.g., object size) and is serialized at a 1024 entry offset frequency. For example, the offset frequency may indicate the number of map entries (compressed size) that occur between absolute offsets in the map. The absolute offset may include an offset into a compressed object in which there is a block to which the associated mapping entry applies. In addition, the compressed size entry may include a compressed size of a single block represented in the map. In some cases, the compressed size entry may be 0 to indicate that compression is not used, or may be a special value to indicate that all compressed size entries of the operation are unknown.

In the example shown, data structure 200 includes a data size column 502 that lists a range of data sizes from 10KB to 2,000,000,000 KB. The data structure also includes seven mapping size columns 504, 506, 508, 512, 514, and 516 listing mapping sizes of block sizes 32KB, 64KB, 128KB, 256KB, 512KB, 1024KB, and 2048KB, respectively, and the corresponding data sizes listed in column 502. For example, when the data size is 100,000KB and the block size is 64KB, the mapping size is 3134 bytes.

In some examples herein, an entire operation (e.g., a set of contiguous blocks) may or may not be compressed. The compressibility of an operation may be determined by the total space savings of all blocks in the operation. For example, if the compression does not reach at least one threshold, then the entire operation may not be compressed.

On the other hand, even though one or more individual blocks in an operation may be incompressible or may be minimally compressible, the entire operation may still be compressed if other blocks in the operation may be sufficiently compressible to reach a compression threshold level for the entire operation. In this case, there may be a problem that the optimal mapping entry size does not fit into the uncompressed block. For example, 64K-1 is the largest number that can accommodate two bytes, which is the optimal entry size for a 64KB sized block in the map. To address this issue, a special size of 64K-1 may be used to indicate that the actual compressed size is at least 64K-1.

Additional bytes may be allocated to other compressed-sized entries in the operation using an allocation algorithm that ensures compaction of the correct combination of entries, discussed further below. When this occurs for at least one block in an operation, the size entries for other blocks in the operation may be incorrect. However, the total compressed operation size is still the sum of the compressed block size entries in the operation. In this case, when any map entry for an operation is of a particular size, decompression may only be aligned on the first block in the operation. When this is the case, it may be necessary to discard the entire decompressed data block before the desired range start is reached.

Since the special size is maintained through the reduction of the mapping, the value may double for each reduction. Special sizes, e.g., two bytes for a 64K block and three bytes for a 16MB block, are only needed when the entry size is optimal, but in this case the special size is maintained by the compaction. As another example, when a special size is used for a map with a 16MB block size, the special size for a 16MB block is ultimately 16 MB-256. As one non-limiting example, the average minimum threshold needed to save space for operations that are considered compressible may be 2 bytes/64K, regardless of block or operation size. A higher compression rate can be configured but a lower compression rate cannot be configured. Further, the operation size may be known during reading, so the operation size value may be included in the mapping header, as discussed further below.

FIG. 3 illustrates an example data structure 300 representing allocation and mapreduce of special value overflow according to some implementations. The data structure 300 includes a block size 302, a compressed size (actual size) 304, a compressed size in the map (with 2 byte entries) 306, a compressed size in the once-reduced map (3 byte entries) 308, and a compressed size in the twice-reduced map (3 byte entries). In fig. 3, the block size is 64K, as shown at 302. Further, the actual size after compression is shown at 304, which indicates that the first four entries are not compressed (as shown by "0" of the compressed size in the map at column 306).

In this example, assume that the first four entries are uncompressed first operations; the second operation starts at row 312 and includes a plurality of special values; the third operation starts at 320 and also includes a plurality of special values. In particular, the actual size after compression shown in column 304 exceeds 64K at rows 312, 316, 318, 322, and 326. Thus, the extra bytes are subtracted and reallocated so that each value in column 306 for rows 312, 316, 318, 322, and 326 equals 64K-1, 65535. For example, adding 6 additional bytes (2 each from 312, 316, and 318) to the value of row 314, increases the value 500 at column 304 to 506 at column 306. Similarly, two bytes (from 322) are added to the value 1000 of row 320 and 3 bytes (from 326) are added to the value of row 324. Thus, the outer bytes are allocated in the other compressed block values to create a special value in column 306.

As described above, a particular piece may be incompressible in the middle of a compressible operation. The compressed size entry in the map may optimize compression by minimizing the map size with a minimum number of bytes. If the number of bytes for an entry is hardly suitable for the block size (e.g., in the case of a 64K block), the compressed size will not fit in the entry unless the block is compressible. Two bytes (mapping entry size of 64K block) can be used to represent a number from 0 to 65535 (64K-1). Above 64K, it cannot be represented by 2 bytes. Thus, in implementations herein, to use compression for such operations, a special size may be used in the map entry to indicate that the actual compressed size is unknown.

When an entire operation (i.e., a group of blocks) is compressible, but one or more blocks are incompressible, an overflow of a particular size from one or more of the incompressible blocks may be allocated to other map entries of the operation. This keeps the post-compression operation size correct and allows navigation at the operation level. For example, if the mapping cannot be reduced, it is sufficient to allocate additional bytes to any other block with a smaller compressed size. However, since the mapping may be reduced later, overflow may be allocated exactly opposite to the combined manner of overflow later during reduction. During compaction of operations with special sizes, the actual compressed size of the block in the operation cannot be determined. This means that the compaction can be as simple as combining adjacent pairs. When none of the adjacent combination entries is of a special size, the result will not be of a special size, and the combination entry may represent the compressed size of the corresponding combination block. In the case that all the compressed sizes are still known, the original allocation may state how the mapping entries may be combined. By using a special size of (64K-1) × 2^ n (where n is a reduced number exceeding 64K block sizes), the reduced block sizes from the 64K block size combinations adjacent to the special size may be combined with the special size to generate the special size in the correct entry of the reduced map.

As shown in column 308, during the first compaction, the first four entries may be combined to have two resulting entries with a value of "0". Further, when the values of row 312 and row 314 are combined in 306, the result in column 308 is 66041, i.e., 65535+506 (where 4 bytes still come from rows 316 and 318, which still cannot be compressed in pairs). Additionally, at rows 316 and 318, the two special sizes combine to produce a new special size 131070. Rows 320 and 322 are combined in a manner similar to 312 and 314, as are rows 324 and 326. During a second refinement, the values in column 308 may be similarly combined by combining adjacent pairs to further refine the mapping to obtain the values in column 310, as shown in column 310.

To ensure that overflow from a particular size is allocated such that all future compaction maintains the integrity of the compressed size, an allocation algorithm such as that described below may be used to allocate overflow. The allocation algorithm may handle the actual compressed size array of operations, the location of the super large entries, special size values (e.g., 64K 1), operation powers (the blocks of each operation are powers of 2), and the number of blocks that cannot be larger than the blocks of each operation on the final operation. Candidates for overflow can be determined by investigating power of 2 blocks starting with 1 and reaching the power of the operation. The starting entry in the array to be investigated can be determined by representing the oversized entry position as a binary value and zeroing the current power of 2 lower order bits. In order, this starts the survey at the beginning of the current power of 2 entries, which can be combined with future potential reduced numbers (powers). These entries are adjusted to include the overflow without exceeding a particular size until the overflow is fully allocated.

In the example of FIG. 3, the second operation (line 312 and 318) and the third operation (line 320 and 326) represent example results of the allocation of the two-byte compressed size entry using the allocation algorithm herein. Examples of pseudo code for the allocation algorithm include the following:

fig. 4 illustrates an example of the operation of an allocation algorithm according to some implementations. In the above allocation algorithm, the allocation algorithm may be executed to make one or more traversals (pass) to allocate overflow from a particular size to other block representations (operation entries). For example, the first example 401 includes an operation entry 402, a first traversal 404, a second traversal 406, and a third traversal 408. For example, as shown at 410, operation entry 402 comprises an incompressible entry. The allocation algorithm may perform a first pass 404 to check for immediately adjacent entries in the same operation 412. Next, if the first traversal did not result in an overflow allocation, the allocation algorithm may be executed to perform a second traversal 406 to check for other entries in the same operation 412. Next, if the second traversal did not result in an overflow allocation, an allocation algorithm may be executed to check for other entries in the neighboring operation 414. The allocation algorithm may continue until the overflow is allocated.

Similarly, in example 420, the operation entry 402 includes an incompressible entry as shown at 422. The allocation algorithm makes a first pass 404 to check for immediately adjacent entries in the same operation 422. Next, if the first traversal did not result in an overflow allocation, the allocation algorithm may be executed to make a second traversal 406 to check for other entries in the same operation 422. Next, if the second traversal did not result in an overflow allocation, the allocation algorithm may check other entries in the adjacent operation 424. The allocation algorithm may continue until the overflow is allocated.

Fig. 5 illustrates an example data structure and layout of a map header 500 according to some implementations. The details of the mapping may fit into two bytes, a first byte 502 and a second byte 504. For example, in addition to block size, operation size (represented as blocks per operation), and offset frequency, additional details recorded during ingestion may also be included in the mapping header to improve read performance. Thus, the first byte 502 may include reserved bits 506, a special value indicator 508(1 bit), a mapping status 510(2 bits), and a block size 512(4 bits) that is a power of 2 minus 10. The second byte 504 may include a block 514(2 bits) of each operation that is a power of 2, and an offset frequency 516(4 bits) that is a power of 2.

Special value indicator 508 may indicate whether any special values are used in the mapping. If not used, an entry like 64K-1 means the actual compressed size, and decompression may begin on the calculated block rather than at the beginning of the operation. For example, "0" may indicate that there is no special value, 64K 1 may be considered a true size, and "1" may indicate that at least one special value is used (or vice versa)

The state of map 510 may indicate whether the map actually exists, whether the map follows a header or a compressed object, and whether the map includes any uncompressed blocks. If no block is compressed, there may be no mapping header because the object is not compressed and thus no mapping is needed. In addition, if all blocks are compressed and a range request for this size object is not expected, then no mapping can be specified. Alternatively, if the mapping is designated inline (inline) and the mapping follows the header, the extra cloud read to obtain the mapping may be eliminated. In addition, if all blocks in the map are compressed, then a read from location 0 does not require a map. For example, "0" may indicate no mapping; a "1" may indicate that the mapping is inline (the mapping content follows the header instead of the compression object); a "2" may indicate that the mapping is external, all blocks are compressed; and a "3" may indicate that the mapping is external and that some blocks are not compressed.

A block size (4 bits) (1KB-32MB block size) 512, represented as ((power of 2) -10), may indicate the block size of the block represented by the map. In this example, the value in the header is "6", which equals 16-10. Thus, when 16 is a power of 2, the block size is 64K.

The blocks 514 of each operation, represented as powers of 2, may be default values or may be specified by a user. Further, the offset frequency (4 bits) 516 of each block expressed as a power of 2 may also be a default value indicating the number of mapping entries (compressed size) that occur between absolute offsets in the mapping.

Based on an example set extending from the data structure 300 discussed above, the block size is 64K, the operation size is 256K (4 blocks), and the absolute offset frequency is 8. Furthermore, the mapping includes special values and some blocks are not compressed.

In addition, header 500 may include optimizations to enable partial map reads to satisfy range requests. For example, since a mapping for a large object may be several megabytes in size, reading the entire mapping for a small range may be too expensive. To solve this problem, an absolute offset amount to a compression target may be placed in the map based on a predetermined offset frequency. This frequency is based on the final number of blocks and may not be affected by the block size or whether the mapping is reduced. The offset frequencies create separate fixed-size segments in the map that can be accessed to determine the range location in the compressed object. The map segment may include an 8 byte absolute offset to the compressed object for the first block followed by a compressed size entry for the subsequent (offset frequency) block. Accordingly, offset frequency 516 may be included in mapping header 500, and thus the mapping may be resolved.

FIG. 6 illustrates an example data structure configuration of a map 600 according to some implementations. The mapping 600 includes an operation size indication 602 indicating the number of compressed block size representations for each operation, and also includes a list 604 of compressed block size representations and a periodic absolute offset to compressed objects 606. For example, the number of bytes used to compress the chunk size representation is minimized based on the chunk size, e.g., 2 bytes unless the chunk size is greater than 64K, in which case 3 bytes 608 may be used to compress the chunk size representation 604. As described above, the compressed size representation may be the actual compressed size of the corresponding block, may be "0" to indicate that no compression is used, or may be a special value to indicate that the individual block compressed size is unknown and the sum of the entries in the operation is the compressed operation size.

The absolute offset 606 is the sum of all compressed block sizes and uncompressed block sizes since the last absolute offset 606. The absolute offset 606 may be 8 bytes 608 and is not just the sum of the entries of the previous blocks. For example, when calculating the absolute offset, any 0 is replaced by the uncompressed block size of the corresponding block, so the absolute offset 606 can be used to locate the block associated with the subsequent entry in the compressed object. If the maximum object size is less than 256TB, two of the eight bytes used for the absolute offset 606 can be hard coded to represent the absolute offset.

Therefore, in the mapping, the compressed block size is expressed in 2 bytes if the uncompressed block size < 64k, or 3 bytes if the block size >64k and < 16M, or 4 bytes if the block size > 16M. Further, for each operation (group of blocks), if each block is compressed to a size smaller than one block, the size of the compressed block is added to the mapping. If all blocks are not compressed, zero can be used as a representation. Otherwise, the sum of the operation sizes may be equal to the compressed size of the operation, but including each individual block represented in the operation does not necessarily indicate the compressed size of the corresponding block.

The mapping here may be configured by the user or the like to best suit the user's needs. If the user needs to quickly access a very small range, the mapping can start with a small chunk size, such as 1 KB. If an optimized mapping with 2 byte entries is the target of an indeterminate workload, an initial block size of 64KB may be used. Larger operation sizes, e.g., 4MB, may be used if memory availability is small or access speed of very large objects is less important. If the compression ratio is more important than the time to read the map, a high offset frequency, e.g., 32768, can be used. Compact tutorials may also be deployed. The mapped memory footprint may be specified for the time the map compaction is attempted during ingestion. While the smaller mappings that are generated may improve efficiency, it may also be specified whether and how many mapreduce should be attempted after completion. The configuration options include block size, operation size, offset frequency, maximum mapping memory, and a reduced worth mapping size target. The operation size may be a power of 2 times the block size. The offset frequency may be a multiple of the block of each operation. Further, while an example data structure configuration of a map is shown in fig. 6, many other possible data structure configurations for performing the same functions will be apparent to those skilled in the art having the benefit of the disclosure herein.

FIG. 7 illustrates an example data structure configuration of a map 700 according to some implementations. In this example, the map 700 includes a block size representation 702 of each block and a description 704 for illustration purposes, respectively. For example, the block size of row 706 + 712 indicates uncompressed, and thus, for row 706 + 712, the value assigned to the block size representation is "0". Further, lines 714, 718, 720, 726, and 730 have entries of a special size, while lines 716, 724, and 728 include the size of the corresponding block plus some overflow from the special size. Additionally, rows 732-738 and 742-746 have normal compressed data, while rows 722 and 740 are used as absolute offsets.

In this example, assume that the block size is 64K, the operation size is 256K (4 blocks), and the absolute offset frequency is 8. Thus, mapping 700 includes five operations: a first operation 750 comprising block size representation 706-712; a second operation 752 comprising a block size representation 714 and 716; a third operation 754 comprising a block size representation 724-730; a fourth operation 756 including a block size representation 732-738; and a fifth operation 758 including a block size representation 742-.

FIG. 8 illustrates an example data structure configuration of a map 800 after one pass reduction according to some implementations. In this example, after one pass of refinement, the block size is 128K (3 byte size entry) and the absolute offset frequency remains 8. Further, the compaction produces two block size representations per operation, each representing a combination of two block sizes from the mapping 700 of FIG. 7. For example, the first operation 750 has been reduced from four block size representations to two block size representations 802 and 804; the second operation 752 has been reduced from four block size representations to two block size representations 806 and 808; the third operation 754 has been reduced from four block size representations to two block size representations 810 and 812; the fourth operation 756 has been reduced from four block size representations to two block size representations 814 and 816; and the fifth operation 758 has been reduced from three block size representations to two block size representations 820 and 746. Since there is an odd block size representation in the fifth operation 758, the final block size representation 746 in this example is not combined with another block size representation during the compaction and therefore retains the same value as before the compaction.

Due to the compaction, the block size representation values of 802 and 804 in the first operation 750 are both "0" based on the combination of "0" during the compaction. A second operation 752 includes a size at 806 with an overflow value and a special size at 808. The total value of operations 750, 752, 754, and 756 at absolute offset 818 is 782330 in the reduced map 800, which is the same as the value of absolute offset 740 in map 700 of fig. 7.

Fig. 9-12 are flowcharts illustrating example processes according to some implementations. The process is illustrated in logical flow diagrams as a collection of blocks representing a sequence of operations, some or all of which may be implemented in hardware, software, or a combination thereof. In the case of software, the blocks may represent computer-executable instructions stored on one or more computer-readable media that, when executed by one or more processors, program the processors to perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, and so forth that perform particular functions or implement particular data types. The order in which the blocks are described should not be construed as a limitation. Any number of the described blocks may be combined in any order and/or in parallel to implement a process or an alternative process, and not all blocks need be performed. For purposes of discussion, the processes are described with reference to the environments, frameworks, and systems described in the examples herein, but the processes may be implemented in a wide variety of other environments, frameworks, and systems.

Fig. 9 is a flow diagram illustrating an example process 900 for compression and map generation according to some implementations. In some cases, process 900 may be performed, at least in part, by a service computing device or other suitable computing device. As one example, a stored program executed by one of the processors 110 of the plurality of processors 110 of the one or more service computing devices 102 discussed above with respect to fig. 1 may include executable instructions for performing at least a portion of the process 900 of fig. 9.

When data is compressed for each block, the result of the compression may be determined and a mapping may be created based on the operation of the plurality of blocks. Each block in the operation is compressed and the results are stored in a map of memory. If compression on the operation is not useful, then compression is not used and all of the compressed block sizes are marked as "0" in the mapping. Alternatively, if compression is useful for the entire operation (e.g., a compression threshold level is achieved by a compression operation), but compression of one or more blocks in the operation is not useful, then the compressed size does not fit into the compressed block size field, and then a special size is entered in the mapping entry for the one or more blocks, e.g., [64K-1] in the case of a two-byte field block representation, or [16M-256] in the case of a three-byte field block representation. Furthermore, an overflow size (e.g., when the amount of compressed size is greater than a particular size) may be allocated to a neighboring block representation with which the particular size block representation may be condensed and have sufficient space.

In other cases, if the operation compression is useful and the compressed block size fits into a mapping entry of the block size representation, the block size representation is marked as the actual compressed size. In some cases, the mapping may be reduced between sets or after data compression is completed. After all operations are completed, the mapping header is finally determined and saved. In some cases, the map header may be appended to the map or may be retained separately. Further, the mapping may be appended to the compressed object in some examples, or the mapping may be stored separately in other examples. The uncompressed object size and the compressed object size are now known and can be recorded.

At 902, prior to beginning compression of a data object, a computing device may receive and/or assign configuration variables, such as: block size, operation size, absolute offset frequency, maximum mapping memory size, maximum mapping size in a database. In some examples, default values for these parameters may have been established by a user, such as an administrator or other user managing the compression operations performed by the service computing device. For the present example, assume a block size of 64K, an operation size of 512K (8 blocks), an absolute offset frequency of 1024 entries, a maximum mapping memory of 1M, and a maximum mapping in DB of 300 bytes. These values are for discussion purposes only, and implementations herein are not limited to any particular values of these parameters.

At 904, the computing device may receive data for compression and store at least a portion of the received data in an ingest buffer, e.g., up to an operational size of the data. In some examples, the data may be received as streaming data over a network, while in other cases the data may be read from a storage location or the like.

At 906, the computing device may compress the received data up to the block size into an output buffer and may store the compressed size of the block in a temporary set of operations.

At 908, the computing device may determine whether there is more data in the ingest buffer. For example, if the data value (worth) of the operation is added to the ingest buffer (e.g., eight block size amounts in this example, corresponding to the specified operation size (512K) mentioned above at 902), the computing device may operate on individual blocks of data in the ingest buffer until the ingest buffer is empty. When the ingest buffer is empty, the data value of the operation has been processed, and the computing device may determine whether the operation has met a compression threshold level. If there is still data in the ingest buffer, the process returns to 906. Otherwise, the process goes to 910.

At 910, the computing device may determine whether the set of blocks in operation is compressed to less than one time the block size minus the number of blocks in operation. For example, the specific value of the compression threshold in this example is (64K-1). 8. If the operation compresses below the threshold, the process passes to 912. If not, the process passes to 914.

At 912, when the compression threshold is met, the computing device may update the set of compressed sizes; values exceeding 64K are reduced to a value equal to 64K-1 and an overflow (difference) may be added to another block size representation entry, as described above, e.g., with respect to fig. 3 and 4. In addition, the computing device may move the values to a list of in-memory block sizes in the map, and may append the compressed data in the output buffer to any already processed object data.

On the other hand, at 914, when the compression threshold is not met, the computing device may save all 0 s to the in-memory block size list in the map and may append the ingest buffer data to any already processed object data.

At 916, the computing device may determine whether there is more data to ingest into the ingest buffer. If so, the process passes to 918. If not, the process passes to 922.

At 918, when there is more data to ingest into the ingest buffer, the computing device may determine whether compaction of the existing mapping is desired by determining whether the block size list is greater than 1 MB. For example, at 902 above, the maximum mapped memory is designated as 1 MB. Thus, at this time, if the mapping size exceeds the amount, the mapping may be reduced. If the amount is not exceeded, the process returns to 904 to add more data to the ingest buffer.

At 920, if the map size exceeds a specified threshold amount (e.g., 1MB in this example), the computing device may reduce the list of block sizes in the map by combining adjacent pairs and doubling the block size. If this operation is feasible (e.g., the block size is still smaller than the operation size), then one reduction is expected to be sufficient to reduce the map size to less than 1MB specified in the configuration information. After compaction mapping, the process returns to 904 to add more data to the ingest buffer.

At 922, when no more data is added to the ingest buffer, this indicates that compression of the data object is complete, and the computing device may determine whether the mapping may be reduced to the maximum DB size. If so, the process passes to 924. If not, the process goes to the process of FIG. 10.

At 924, the computing device may reduce the list of block sizes in the map by combining the neighbor pairs and doubling the block size until the map size is less than the maximum map DB size (e.g., less than 300 bytes in this example). After compaction mapping, the process goes to the process of FIG. 10.

Fig. 10 is a flow diagram illustrating an example process 1000 that is a continuation of process 900 of fig. 9 according to some implementations.

At 1002, the computing device may finalize a mapping header for the mapping with the block size, operation size, offset frequency, and other parameters discussed above with respect to fig. 5, and may store at least the mapping header in a database, such as metadata database 120 discussed above with respect to fig. 1. In some cases, the mapping may also be stored in a database and/or may be appended to the object for storage in network storage.

At 1004, the computing device may determine whether the mapping size list is less than 300 bytes, e.g., a maximum DB size. If not, the process passes to 1006. If so, the process passes to 1014.

At 1006, the computing device may append a size list entry to the compressed object and add the size to the running sum by using the final block size instead of 0 so that the total size in the absolute offset is accurate.

At 1008, the computing device may determine whether there are more entries in the map. If so, the process passes to 1010. If not, the process passes to 1016.

At 1010, the computing device may determine whether the additional entry is at an offset frequency. If so, the process passes to 1012. If not, the process passes to 1006.

At 1012, the computing device may append the running sum of the size list entries to the compressed object and return to 1006.

At 1014, when the map size list is less than the database maximum threshold (e.g., less than 300 bytes in this example), the computing device can append the map header to the metadata base with the size list as the map.

At 1016, the computing device can send the compressed object (and the map, if the map is appended to the compressed object) to a storage location for storage. For example, as discussed above with respect to fig. 1, in some examples, the compressed object and the map may be stored in a network storage location.

Fig. 11 is a flow diagram illustrating an example process 1100 for map-based decompression according to some implementations. In some cases, process 1100 may be performed, at least in part, by a service computing device or other suitable computing device. As one example, a stored program executed by one of the processors of the one or more service computing devices 102 discussed above with respect to fig. 1 may include executable instructions for performing at least a portion of process 1100 of fig. 11.

Upon receiving a request to access at least a portion of a compressed data object, the service computing device may access and read the mapping header of the object to determine parameters including a block size, an operation size, an offset frequency, a mapping status, and a special value indicator, e.g., as discussed above with respect to fig. 5. If there is no mapping or all blocks are compressed and the request starts from position 0, the computing device may decompress all data for the object without a mapping. For example, the object may not be large enough to generate a map, etc.

In some examples, the mapping may be found after the header if it is stored in the metadata database with the header, or in other examples, the mapping is appended to the end of the compressed object or stored in a separate storage location. The map size may be calculated as the number of blocks (uncompressed object size divided by block size) multiplied by the compressed block size, representing the number of bytes in the entry plus the number of absolute offsets (number of blocks divided by offset frequency) multiplied by the number of blocks in the operation (e.g., 8 in the examples of fig. 9 and 10). If a mapping is appended to a compressed object, the mapping can be found at the mapping size byte before the end of the compressed object.

The computing device may determine the mapped segment of the request start based on, for example, (request start/(offset frequency block size)). The computing device may read the mapped segments from the computed first segment to the final segment for the request size. The computing device may skip to the last operation that does not follow the start of the request. If any block in the operation can, and does, have a particular size (multiple of 64K-1), the computing device can decompress, starting with the first block in the operation, discarding the data before the start of the request. If no block in operation has a 0 or special size, the computing device may skip the block until the next block location is higher, begin decompression, and discard the data before the start location of the request. If the compressed size entry is 0, the computing device may copy data from the compressed object, starting from a location within the requested operation through to the rest of the operation. If any of the compressed size entries is 0, the computing device may continue to decompress or copy the data for subsequent required operations until the requested size is satisfied.

At 1102, a computing device can receive a request to access data and can obtain parameters, such as block size, operation size, and offset frequency, from a corresponding mapping header. For example, a user, application, etc. may request access to one or more portions of one or more stored compressed objects. For example, a read request may be for a series of data from a compressed object. The starting position and size may be specified in the request based on, for example, the starting position and size in the uncompressed object.

At 1104, the computing device may determine a size of the mapping based on the known uncompressed object size, e.g., the mapping may begin at the compressed object size minus the mapping size. For example, during ingestion and compression of an object, the compressed object size and the uncompressed object size may already be stored in the metadata database.

Further, the computing device can read the mapped segments, set the compressed offset to (first value-absolute offset), and set the uncompressed position to (number of segments offset frequency block size). for example, the mapped segments can be a collection of mapped entries that can be used without mapping the rest, since the mapped segments start with an absolute offset to establish a bearing (bearings) in the compressed object.

At 1108, the computing device may determine whether (location + operation size) is less than the request start. If so, the process passes to 1110. If not, the process passes to 1112. For example, the location may be a starting location, e.g., based on a location determined from an uncompressed object.

At 1110, the computing device may obtain the sizes of (the blocks of each operation), add them to the compression offset by using the actual block size of 0 (if any), and add the operation size to the location. The process returns to 1108 to determine if (location + operation size) is less than the request start. The operations of 1108 and 1110 are repeated until the process passes to 1112.

At 1112, the computing device may obtain all of the compressed sizes for the current operation and may proceed to the process of fig. 12.

FIG. 12 is a flow diagram illustrating an example process 1200 that is a continuation of process 1100 of FIG. 11 according to some implementations.

At 1202, the computing device may determine whether the compressed sizes are all zero. If so, the process passes to 1218. If not, the process passes to 1204.

At 1204, the computing device may determine whether any special values (multiples of (64K 1)) exist. If so, the process passes to 1214. If not, the process passes to 1206.

At 1206, the computing device may determine whether (location + block size) < request start. If so, the process passes to 1208. If not, the process passes to 1210.

At 1208, the computing device may add the compressed size to the offset and may add the block size to the location.

At 1210, the computing device may decompress and discard the remaining blocks in operation (request start-position).

At 1212, the computing device may add the remaining compressed size to the offset, add the remaining block size to the location, and add the read data to the start of the request and subtract the read data from the request size.

At 1214, where a special value is found to exist at 1204, the computing device can decompress and discard all of the operation data (request start-location)

At 1216, the computing device may add all of the compressed sizes to the offset, add the operation size to the location, add the read data to the start of the request and subtract the read data from the request size.

At 1218, if the compressed sizes of the operations at 1202 are all zero, the computing device may begin reading the data without decompression at (offset + request start-position) and may read all the way to the rest of the operation data.

At 1220, the computing device may add the operation size to the location and offset, add the read data to the request start and subtract the read data from the request size

At 1222, the computing device may determine whether the request size is greater than zero. If so, the process passes to 1224. If not, the process passes to 1228.

At 1224, the computing device may determine whether there are more block representations in the segment. If so, the process passes to block 1112 of the process of FIG. 11. If not, the process passes to 1226.

At 1226, the computing device may read the next mapped segment and pass to block 1112 in FIG. 11.

At 1228, when the request size has reached zero, the computing device may decompress any compressed data that has not been decompressed and may reply to the request.

The example processes described herein are merely examples of processes that are provided for discussion purposes. Many other variations will be apparent to those of skill in the art in light of the disclosure herein. Furthermore, although the disclosure herein sets forth several examples of suitable frameworks, architectures, and environments for performing processes, implementations herein are not limited to the specific examples shown and discussed. Further, the present disclosure provides various example implementations as described and illustrated in the drawings. However, the present disclosure is not limited to the embodiments described and illustrated herein, but may be extended to other implementations known or to be known to those skilled in the art.

The various instructions, processes, and techniques described herein may be considered in the general context of computer-executable instructions (e.g., computer programs) that are stored on a computer-readable medium and executed by one or more processors herein. Generally, programs include applications, instructions, routines, objects, components, data structures, executable code, and the like for performing particular tasks or implementing particular abstract data types. These programs, etc., may be executed as native code or may be downloaded and executed, for example, in a virtual machine or other just-in-time compilation execution environment. Typically, the functionality of the programs may be combined or distributed as desired in various implementations. An implementation of these modules and techniques may be stored on a computer storage medium or transmitted over some form of communications medium.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claims.

33页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：竞争者选择

Optimizing storage and retrieval of compressed data

相关技术

网友询问留言