Data processing system and method

文档序号:1602702 发布日期:2020-01-07 浏览:30次 中文

阅读说明:本技术 数据处理系统和方法 (Data processing system and method ) 是由 丹尼尔·格林菲尔德 于 2018-05-24 设计创作,主要内容包括:一种数据处理系统,包括数据处理装置,其中,所述数据处理装置包括用于执行一个或多个软件产品的计算硬件,其中,所述一个或多个软件产品的执行配置所述数据处理装置,以从文件系统装置访问数据。所述数据处理装置能够被操作来加载动态链接器,所述动态链接器能够被操作来包括拦截库,拦截库拦截可执行软件产品的文件访问操作,其中:(1)所述文件系统装置中不存在的虚拟文件能够被所述可执行软件产品访问;(2)所述虚拟文件是以下任一情况的结果:(a)所述文件系统装置中一个或多个真实文件的即时转换,其中,所述即时转换是由一种文件格式到另一种文件格式,或者(b)将访问操作转换为对位于云对象存储上的真实对象的等效云对象存储的所述访问操作,或者(c)所述即时转换(a)和对象访问转换(b)的组合。可选地,至少其中之一:(3)所述虚拟文件是与所述真实文件/对象不同的文件格式,其中,压缩文件格式和所述虚拟文件是互不相同的文件格式;以及(4)所述文件格式是基因组文件格式。(A data processing system comprising data processing apparatus, wherein the data processing apparatus comprises computing hardware for executing one or more software products, wherein execution of the one or more software products configures the data processing apparatus to access data from a file system apparatus. The data processing apparatus is operable to load a dynamic linker, the dynamic linker being operable to include an interception library, the interception library intercepting file access operations of an executable software product, wherein: (1) virtual files not present in the filesystem means are accessible by the executable software product; (2) the virtual file is the result of any of the following: (a) an instant conversion of one or more real files in the file system apparatus, wherein the instant conversion is from one file format to another file format, or (b) an access operation is converted to an access operation on an equivalent cloud object store of real objects located on a cloud object store, or (c) a combination of the instant conversion (a) and object access conversion (b). Optionally, at least one of: (3) the virtual file is in a different file format than the real file/object, wherein the compressed file format and the virtual file are in different file formats from each other; and (4) the file format is a genome file format.)

1. A data processing system comprising data processing apparatus, wherein the data processing apparatus comprises computing hardware for executing one or more software products, wherein execution of the one or more software products configures the data processing apparatus to access data from a file system apparatus,

wherein the data processing apparatus is operable to load a dynamic linker, the dynamic linker being operable to include an interception library that intercepts file access operations of an executable software product, wherein:

(1) virtual files not present in the filesystem means are accessible by the executable software product;

(2) the virtual file is the result of any of the following: (a) an instant conversion of one or more real files in the file system apparatus, wherein the instant conversion is from one file format to another file format, or (b) an access operation is converted to an access operation to an equivalent cloud object store of real objects located on the cloud object store, or (c) a combination of the instant conversion (a) and the object access conversion (b).

2. The data processing system (10) of claim 1, characterized by at least one of:

(3) the virtual file is in a different file format than the real file/object, wherein the compressed file format and the virtual file are in different file formats from each other; and

(4) the file format is a genome file format.

3. The data processing system of claim 1, wherein the compressed file format is a compressed genome file format and the another file format is another genome file format.

4. The data processing system of claim 1, wherein the dynamic linker comprises a mandatory-style intercept library.

5. A data processing system according to claim 3, wherein the executable software product is operable to access genomic data by using a ptrace call, wherein the ptrace call allows manipulation of file descriptors, data memory and data registers.

6. A data processing system according to claim 5, wherein the ptrace call is operable to intercept a forced call by the data processing system to a child process, the child process being executed by a kernel-provided executable trace function intercepting file system calls, wherein:

(1) virtual files that do not exist in the file system apparatus are accessible by the data processing system;

(1) the virtual file is an instant conversion of one or more real files in the file system device from one file format to another file format;

(3) the real file is in a compressed genome file format and the virtual file is in another genome file format; and

(4) in operation, a system call to open the virtual file is intercepted and first processed by ensuring that the virtual file system that has been installed is usable and then redirecting the system call to a file on the virtual file system.

7. A data processing system according to claim 6, wherein in (4) the virtual file system that has been installed is implemented as a temporary directory, wherein if no mount point already exists, the data processing system is operable to automatically create the mount point for the virtual file system to exist.

8. The data processing system of any preceding claim, wherein the dynamic linker is operable to intercept the system call to a sub-process executable by just-in-time recompilation of a portion of binary code prior to running the binary code, wherein,

(1) said virtual file not present in said file system means is accessible;

(2) the virtual file is an instant translation of one or more real files on the file system device, wherein the instant translation is from one file format to another file format

(3) The real file is in a compressed genome file format and the virtual file is in another genome file format; and

(4) a system call to open the virtual file is intercepted and handled by first ensuring that the virtual file system that has been installed is usable and then redirecting the system call to a file on the virtual file system.

9. The data processing system of claim 8, wherein the virtual file system that has been installed is implemented as a temporary directory, wherein mount points are automatically created for the virtual file system to make it exist.

10. A data processing system according to any preceding claim, wherein the immediate conversion of transparent access to genomic data is operable to combine content from a plurality of genomic files and display it as one genomic file, for use in any one or combination of:

(1) wherein the merged content is quality score data;

(2) wherein the merged content is read name information;

(3) wherein the merged content is an ancillary tag of the mapped genomic read content;

(4) wherein the merged content comprises different genomic regions;

(5) wherein the pooled content comprises a plurality of genomic samples/specimens; and

(6) where different genome files represent different regions, samples, or other separable portions of a particular genome.

11. A data processing system according to any preceding claim, wherein the dynamic linker is forcibly loaded and in operation uses an interception library which intercepts file access operations of an executable software product, wherein:

(1) creation of a new child process retains the interception base in the associated interception environment variable.

12. A data processing system according to any preceding claim, wherein the dynamic linker is operable to employ an interception library which intercepts file access operations of an executable software product, wherein:

(1) the interception library detects whether a program is being submitted to a work submission system, and if so:

(2) creating a temporary shell script, wherein the shell script reserves an interception environment variable before calling an original program; and

(3) and submitting the new temporary script to a work submission system to replace the original program.

13. The data processing system of claim 12, wherein prior to implementing (3), the data processing system is operable to:

(1) and detecting whether the original program is a script containing specific metadata of a work submission system, and if so, copying the metadata information into a new temporary shell script.

14. The data processing system of any preceding claim, wherein the data processing system is operable to provide transparent access to genomic data such that access under a virtual path (e.g./pgs 3/) is redirected to cloud storage by converting operations into equivalent, converted requests, which are sent via the internet to a provider of cloud storage.

15. The data processing system of claim 14, wherein:

(1) corresponding virtual files not present in the cloud storage are accessible by the data processing system;

(2) the virtual file is an immediate conversion of one or more corresponding real objects on the cloud storage from one file format to another file format;

(3) the real object is in a compressed genome file format and the virtual file is in another genome file format.

16. A data processing system according to any preceding claim, wherein the data processing system is operable to provide transparent access to genomic data such that a dynamic linker is operable to provide the intercepting library for memory mapped file access operations of an executable file to a virtual file by:

(1) registering a page fault interrupt handler;

(2) creating a virtual area, but protecting the virtual area from reading and writing, the virtual area having a size required by a memory mapped file mapping operation;

(3) upon read access to a particular protected page or pages, replacing the page or pages with corresponding translated content from the real file and allowing the particular protected page or pages to be accessed, and thus read and/or written; and

(4) maintaining a list of the one or more pages of translated content and releasing memory occupied by translated content when a memory consumption limit is reached, selecting one or more pages of translated content, releasing memory of the one or more pages and again protecting these page regions from further reads and writes; and

(5) where the selection of which page to release is made by using LRU (least recently utilized), LFU (least frequently utilized), or other replacement heuristics.

17. A data processing system according to any preceding claim, operable to provide transparent access to genomic data such that it intercepts dynamic linkers, and operable to provide the intercept library, thereby enabling memory mapped file access operations of executable files to virtual files, wherein:

(1) the system call to memory map the virtual file is intercepted and processed by first ensuring that the installed virtual file system is available for use (perhaps in a temporary directory, where "ensure" means that if the mount point does not already exist, the mount point is automatically created for the virtual file system to make it exist), and then redirecting the memory mapping operation to the file on the virtual file system.

18. A method of using a data processing system comprising data processing apparatus, wherein the data processing apparatus comprises computing hardware for executing one or more software products, wherein execution of the one or more software products configures the data processing apparatus to access data from a file system apparatus,

wherein the method comprises operating the data processing apparatus to load a dynamic linker operable to include an interception library that intercepts file access operations of an executable software product, wherein:

(1) virtual files not present in the filesystem means are accessible by the executable software product;

(2) the virtual file is the result of any of the following: (a) one or more real-file just-in-time translations of the file system apparatus, wherein the just-in-time translations convert from one file format to another file format, or (b) convert access operations to equivalent cloud object storage access operations for real objects located on a cloud object storage, or (c) a combination of just-in-time translations (a) and object access translations (b).

19. The method of claim 18, further comprising at least one of:

(3) the virtual file is in a different file format than the real file/object, wherein the compressed file format and the virtual file are in different file formats; and

(4) the file format is a genome file format.

20. The method of claim 18 or 19, wherein the compressed file format is a compressed genome file format and the another file format is another genome file format.

21. A computer program product comprising a non-transitory computer readable storage medium having computer readable instructions stored thereon, the computer readable instructions being executable by a computing device comprising processing hardware to perform the method of claim 18.

Technical Field

The present disclosure relates to data processing systems. Furthermore, the present disclosure relates to a method of processing data, for example, a method of processing genomic data, using the above-described data processing system. Additionally, the present application relates to a computer program product comprising a non-transitory computer readable storage medium having computer readable instructions stored thereon, the computer readable instructions being executable by a computing device comprising processing hardware to perform the above method.

Background

Current data processing systems often require access to data stored in one or more libraries while performing data processing functions; the database may, for example, include genomic data. The data processing system is capable of being executed under software control to implement functions by executing one or more software products.

In generating a software product, in current practice, a linker is employed to compile a plurality of items of software to produce executable software code. The executable software code is referred to as a software product. In addition, many types of linkers are known, such as those that are searchable in Wikipedia (Wikipedia), where dynamic linkers can be affected to modify their behavior when a particular program is executed or linked, and examples of dynamic linkers may be found in the handbook pages of runtime linkers in various Unix-type systems, where,

Figure BDA0002287015430000011

is a trademark. Mentioned in LD _ LIBRARY _ PATH and LD _ PRELOAD environment variablesTypical modifications to the behavior of such dynamic linkers are made, wherein the dynamic linkers modify the linker at run-time by searching shared libraries at different locations and forcing loading and linking libraries that otherwise could not be loaded and connected separately. An example of modifying executable behavior via a dynamic linker is Zlibc, also known as "uncompressions. so" when used with LD _ PRELOAD hack, the Zlibc dynamic linker has transparent decompression (transparent decompression) functionality and thus can be used at BSD and BSD

Figure BDA0002287015430000012

Reading pre-compressed (gzipped) file data in the system, just like uncompressed files, actually allows a particular user to add transparent decompression on the base file system even if some warning is present. To achieve this functionality, a flexible mechanism is employed that allows fine-tuning of the same specific code before providing the data to the specific user process that requested the data, so that additional or alternative processing can be performed on the data when reading a specific file.

However, currently known dynamic linkers are not yet sophisticated enough to modify a wide variety of data in a repository in a dynamic manner.

Disclosure of Invention

The present disclosure seeks to provide an improved data processing system that is capable of translating file access to, for example, compressed genomic data or cloud object storage in a more flexible and dynamic manner.

In a first aspect, there is provided a data processing system comprising data processing apparatus, wherein the data processing apparatus comprises computing hardware for executing one or more software products, wherein execution of the one or more software products configures the data processing apparatus to access data from a file system apparatus,

wherein the data processing apparatus is operable to load a dynamic linker operable to include an interception library to intercept file access operations of an executable software product, wherein:

(1) virtual files that do not exist in the filesystem device are accessible by the executable software product;

(2) the virtual file is the result of either: (a) an instant conversion of one or more real files in a file system apparatus, wherein the instant conversion is from one file format to another file format, or (b) an access operation is converted to an access operation of an equivalent cloud object store to a real object located on the cloud object store, or (c) a combination of the instant conversion (a) and the object access conversion (b).

Optionally, in the data processing system, at least one of:

(3) the virtual file is in a different file format from the real file/object, wherein the compressed file format and the virtual file are in different file formats from each other; and

(4) the file format is a genome file format.

An advantage of the present invention is that by using an interception library, the dynamic linker can be operated to make the data processing system more versatile in its utilization of dynamically changing data files and data file formats when performing calculations (e.g. calculations on data acquired from sensor devices, e.g. relating to genome reads).

In a second aspect, there is provided a method of using a data processing system comprising data processing apparatus, wherein the data processing apparatus comprises computing hardware for executing one or more software products, wherein execution of the one or more software products configures the data processing apparatus to access data from a file system device,

characterised in that the method comprises operating the data processing apparatus to load a dynamic link operable to include an interception library to intercept file access operations of the executable software product, wherein:

(1) virtual files that do not exist in the filesystem device are accessible by the software product of the executable file;

(2) the virtual file is the result of either: (a) an instant translation of one or more real files of a file system device, wherein the instant translation translates from one file format to another file format, or (b) translates access operations to equivalent cloud object storage access operations to real objects located on a cloud object storage, or (c) a combination of instant translation (a) and object access translation (b).

Optionally, in such a method, at least one of:

(3) the virtual file is in a different file format from the real file/object, wherein the compressed file format and the virtual file are in different file formats from each other; and

(4) the file format is a genome file format.

Optionally, in the data processing system, at least one of:

(3) the virtual file is in a different file format from the real file/object, wherein the compressed file format and the virtual file are in different file formats from each other; and

(4) the file format is a genome file format.

In a third aspect, embodiments of the present disclosure provide a computer program product comprising a non-transitory computer-readable storage medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a computing device comprising processing hardware to perform a method according to the foregoing second aspect.

Other aspects, advantages, features and objects of the present disclosure will become apparent from the following drawings and detailed description of illustrative embodiments, when read in conjunction with the appended claims.

It should be understood that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the disclosure as defined by the accompanying claims.

Drawings

The foregoing summary, as well as the following detailed description of illustrative embodiments, will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the disclosure, there is shown in the drawings exemplary constructions of the disclosure. However, the present disclosure is not limited to the specific methods and apparatus disclosed herein. Further, those skilled in the art will appreciate that the drawings are not drawn to scale. Wherever possible, like elements are designated with like numerals.

Embodiments of the present disclosure will now be described, by way of example only, with reference to the following drawings, in which:

FIG. 1 is a schematic diagram of a data processing system according to the present disclosure; and

FIG. 2 is an illustration of a flow chart of a method of operating the data processing system of FIG. 1.

In the drawings, underlined numerals are used to indicate items on or adjacent to the underlined numerals.

Detailed Description

The following detailed description illustrates embodiments of the disclosure and methods by which it can be practiced. Although some ways of implementing the present disclosure have been disclosed, those skilled in the art will recognize that other embodiments for implementing or practicing the present disclosure are possible.

A virtual file is a file (or directory) that does not actually exist in the file system of a computer system. However, the virtual file system may include the entire file system tree of virtual files, which may otherwise appear to be located on paths within the existing real file system, although the virtual files do not actually exist in the real file system.

File systems (including virtual file systems) need to be installed on the path through a POSIX-compliant operating system, which is typically performed by users with authority. On the contrary, the present invention is not limited to the above-described embodiments,

Figure BDA0002287015430000041

the FUSE system in (1) allows users without authority to mount file systems (including virtual file systems) on paths with the help of the FUSE kernel model. However, if these limitations are not explicitly relaxed, then in a container environment (e.g.) May not be available under more stringent security rights. This is also undesirable: a user has installed the file system and,all other users can also typically see its presence.

In some cases, it is helpful to be able to use the virtual file system under any of the following conditions: (a) operate without system mount points, (b) operate in existing mounted file systems; and/or (c) as in

Figure BDA0002287015430000051

And work under limited rights in a limited environment.

Another approach to using an installed virtual file system is to use dynamic links to intercept and modify the access to the file system that can be performed. Such mechanisms include LD _ load in Linux-based systems and DLYD _ INSERT _ LIBRARIES in BSD-based systems (including Mac OS). Here, on loading the executable, parsing the symbol into a dynamic library (such as glibc. so for executable access file systems in most Linux) can be intercepted with an alternate provisioning library by providing matching symbol names. The standby library may intercept calls by the executable to matching symbols, thereby altering the functionality of those symbols, thereby changing the overall behavior of the executable. For example, the executable file assumes the 'open 64' symbol allows subsequent access to open the file and returns a file handle. The standby library may intercept calls to this symbol to provide standby behavior, such as opening a different file than the originally specified file. This interception mechanism for file system access has the advantage of not requiring any specific permissions and enables virtual files to appear to be saved in an installed file system. However, this interception mechanism for file system access has many deficiencies, which have not been put into practical use.

Has the following disadvantages:

1. although file system access can be easily intercepted by inserting operations such as open, read, and write, read/write operations of the memory mapped file are handled by the operating system, which reads or writes data directly from or to the file system through the kernel and thus is not intercepted by the library.

2. So, statically compiled binary files, or binary files that do not use libraries like glibc.so to access the file system but directly use operating system calls, are not intercepted by the dynamic linking mechanism.

In general, referring to FIG. 1, a data processing system 10, a data processing apparatus 20 is provided, wherein the data processing apparatus 20 comprises computing hardware 30 for executing one or more software products 40, wherein execution of the one or more software products 40 configures the data processing apparatus to access data from a file system device, characterized in that the data processing apparatus 10 is operable to load a dynamically linked library 50, the dynamically linked library 50 being operable to comprise an interception library 50, the interception library 50 intercepting file access operations of executable software products, wherein:

(1) virtual files that do not exist in the filesystem device are accessible by the executable software product;

(2) the virtual file is the result of either: (a) an instant conversion of one or more real files in a file system apparatus, wherein the instant conversion is from one file format to another file format, or (b) an access operation is converted to an access operation of an equivalent cloud object store to a real object located on the cloud object store, or (c) a combination of the instant conversion (a) and the object access conversion (b).

Optionally, in the data processing system, at least one of:

(3) the virtual file is in a different file format from the real file/object, wherein the compressed file format and the virtual file are in different file formats from each other; and

(4) the file format is a genome file format.

Optionally, the compressed file format is a compressed genome file format and the another file format is another genome file format.

Using ZLIB/GZIP based compression techniques, a large amount of genomic data has been compressed in the standard format of BAM or fastq. Since the size of a single file can be many GB (gigabytes), and sometimes TB (terabytes), it is also a challenge for organizations to store and transfer these files. Better compression will help reduce storage costs and data transfer time. However, compression into new or updated file formats necessarily destroys the compatibility of existing bioinformatics tools and procedures that do not support such new file formats. While it is possible to decompress it back into the original file format (e.g., BAM or fastq. gz) and then input it into the tool and process, this needs to be explicitly noted as part of the process, or even if only a small portion is actually used for analysis, it may take time to decompress the entire file first before processing. Therefore, it is desirable to be able to transparently convert from a new, better compressed file format to a compatible file format with a small amount of compression.

Another problem with processing genomic data is that the process consists of executable files that can be read/written to storage through access of the POSIX file, rather than being built for reading or writing to cloud storage, which is typically operated on by REST access. It would be beneficial if these tools/processes could access cloud storage like regular files.

Optionally, in the data processing system, the dynamically linked library comprises an intercept library in a mandatory manner.

Optionally, in the data processing system, the executable software product may be used to access the genomic data by using a ptrace call, wherein the ptrace call allows manipulation of the file descriptor, the data store, and the data register library. Alternatively, in a data processing system, the ptrace call may be operable to intercept a forced call by the data processing system of a sub-process that the intercepting file system may call through an executable trace function provided by the kernel, wherein:

(1) the data processing system may access virtual files that do not exist in the file system device;

(2) virtual files are the immediate conversion of one or more files in a file system device from one file format to another;

(3) the real file is a compressed genome file format, and the virtual file is another genome file format; and

(4) (. one) in operation, a system call to open a virtual file is intercepted and first processed by ensuring that the virtual file system that has been installed is available for use and then redirecting the system call to the file on the virtual file system.

Optionally, the data processing system is operable to intercept filesystem access of the child process to provide the child process with access to the virtual file, wherein the data processing system is operable to:

(1) intercepting the library using a dynamic link interception mechanism (e.g., LD _ PRELOAD);

(2) intercepting calls in the library to create new sub-processes (e.g., by intercepting exec variables such as exec and fork/vfork in Linux);

(3) examine the executable file of the new sub-process to determine if it depends on the appropriate interception library (e.g., by checking if it has a dynamic dependency on glibc. so);

and, wherein:

(4) if the child process relies on an interception library, then the child process is allowed to be created, but it is ensured that the dynamic link interception mechanism can be used (e.g., by ensuring that the LD _ load environment variable includes the necessary interception library that intercepts file system accesses to provide virtual files; or

(5) If the child executable does not rely on the interception library, a check is needed to see if the backup interception mechanism is available (i.e., if the parent process has sufficient authority to apply the backup interception mechanism), wherein if the backup interception mechanism is available, the backup interception mechanism is applied in the child process.

Alternatively, if a plurality of such backup interception mechanisms are available, the data processing system may be operable to select one such backup interception mechanism that is available.

Optionally, the backup interception mechanism allows the sub-process to continue running, but sets the interception system call for the sub-process (e.g., by using the ptrace mechanism in Linux). Alternatively, the interception of system calls is limited to only the target item for performance considerations (e.g., by using seccomp filter library on Linux). In this manner, operations of all file systems can be intercepted, and any operation on the virtual file can be translated.

Optionally, the alternate interception mechanism is to utilize a just-in-time recompilation library (such as the PIN tool of Intel or Dynamori of HP-MIT) running on the sub-process, wherein the just-in-time recompilation library is configured to detect and intercept system calls through the sub-process. In this manner, system calls that access the file system may be redirected to the standby code that provides the virtual file.

Alternatively, the data processing system may be operable to redirect any file name (or pathname) based access to an equivalent entry within the virtual file and virtual file system mount point (e.g., for Linux, the FUSE is mounted to a temporary limited access directory) wherein if no such mount point currently exists, a new mount is created immediately prior to using the modified system call. In this arrangement, a system call to an operation such as "open file" will return a valid file handle, where read/write operations to the file handle will not require further system call interception, thereby improving performance, but will be intercepted by methods based on the mounted virtual file system.

Further optionally, in the data processing system, in (4), the installed virtual file system is implemented as a temporary directory, wherein if no mount point already exists, the data processing system is operable to automatically create a mount point for the virtual file system to exist.

Optionally, in the data processing system, the dynamic linker may be operable to intercept a system call of the executable sub-process by just-in-time recompiling of a portion of the binary code prior to running the binary code, wherein,

(1) virtual files that do not exist in the file system apparatus can be accessed;

(2) a virtual file is an instant conversion of one or more real files on a file system device, wherein the instant conversion is converted from one file format to another file format;

(3) the real file is a compressed genome file format, and the virtual file is another genome file format; and

(4) the system call to open the virtual file is intercepted and handled by first ensuring that the installed virtual file system is available for use, and then redirecting the system call to the file on the virtual file system.

Further optionally, in the data processing system, the installed virtual file system is implemented as a temporary directory, wherein mount points are automatically created for the virtual file system to exist.

Optionally, in the data processing system, the immediate conversion of transparent access to genomic data is operable to combine content from a plurality of genomic files and display it as one genomic file, applicable to any one or combination of:

(1) wherein the merged content is quality score data;

(2) wherein the merged content is read name information;

(3) wherein the merged content is an ancillary tag of the mapped genomic read content;

(4) wherein the merged content comprises different genomic regions;

(5) wherein the pooled content comprises a plurality of genomic samples/specimens; and

(6) where separate genome files represent different regions, samples, or other separable portions of a given genome.

Optionally, in the data processing system, the dynamic linker is forced to load and in operation uses an interception library that intercepts file access operations of the executable software product, wherein:

(1) the creation of a new sub-process retains the interception base in the associated interception environment variable.

Optionally, in the data processing system, the dynamic linker is operable to employ an interception library that intercepts file access operations of the executable software product, wherein:

(1) the interception library detects whether the program is being submitted to the work submission system, and if so:

(2) creating a temporary shell script, wherein the shell script reserves an interception environment variable before calling an original program; and

(3) new temporary scripts are submitted to the work submission system in place of the original program,

further optionally, in the data processing system, prior to performing (3), the data processing system is operable to:

(4) and detecting whether the original program is a script containing work submission system specific metadata, and if so, copying the metadata information into a new temporary shell script.

Alternatively, the data processing system may be operated to provide transparent access to genomic data, redirecting access under a virtual path (e.g.,/pgs 3/) to the cloud storage by converting the operation into equivalent, converted requests, which are sent via the internet to the provider of the cloud storage.

This also has some complex problems, in which accessing genomic data on cloud storage not only poses challenges to the ability of tools and processes to access from the cloud storage direct stream (stream access), but also suffers from considerable delay and cost due to the bulkiness of the genomic data.

It would be advantageous if objects in cloud storage used a better compressed file format and were instantly converted to a standard file format. In this scheme, file system access to a standard file format is converted to equivalent cloud access to a better compressed file format. Less data will need to be transferred from the cloud object storage due to better compression, thus speeding up access, but converting better compressed data into a standard file format requires some computational overhead.

Further optionally, in the data processing system:

(1) the data processing system may access corresponding virtual files that do not exist in the cloud storage;

(2) a virtual file is an immediate conversion of one or more corresponding real objects on a cloud storage device from one file format to another;

(3) where the real object is in a compressed genome file format and the virtual file is in another genome file format.

Optionally, the data processing system is operable to provide transparent access to the genomic data such that the dynamic linker is operable to provide an interceptor library for memory mapped file access operations of the executable file to the virtual file by:

(1) registering a page fault interrupt handler;

(2) creating a virtual area, but protecting the virtual area from reading and writing, the virtual area having a size required by a memory mapped file mapping operation;

(3) upon read access to one or more particular protected pages, replacing one or more pages with corresponding translated content from the real file and allowing the one or more particular protected pages to be accessible for reading and/or writing; and

(4) maintaining a list of one or more pages of translated content, and releasing memory occupied by the translated content when a memory consumption limit is reached, selecting one or more pages of translated content, releasing memory of the one or more pages and again protecting these page areas from further reads and writes; and

(5) wherein the selection of which page to release is made using LRU (Least Recently Used), LFU (Least Frequently Used), or other alternative heuristics.

Optionally, the data processing system is operable to provide transparent access to the genomic data such that it intercepts the dynamic linker and is operable to provide an intercept library to allow memory mapped file access operations of the executable file to the virtual file, wherein:

(1) where a system call to memory map a virtual file is intercepted and handled by first ensuring that the installed virtual file system is available for use (perhaps in a temporary directory, where "ensure" means that if a mount point does not already exist, the mount point is automatically created for the virtual file system to make it exist), and then redirecting the memory mapping operation to the file on the virtual file system.

According to another aspect, a method of using a data processing system 10 is provided, the data processing system 10 comprising a data processing apparatus 20, wherein the data processing apparatus 20 comprises computing hardware 30 for executing one or more software products 40, wherein the execution of the one or more software products 40 configures the data processing apparatus 20 to access data from a file system device,

characterised in that the method comprises operating the data processing apparatus to load a dynamic link operable to include an interception library to intercept file access operations of the executable software product, wherein:

(1) virtual files that do not exist in the filesystem device are accessible by the executable software product;

(2) the virtual file is the result of either: (a) one or more real-file just-in-time translations of a file system apparatus, wherein the just-in-time translations convert from one file format to another file format, or (b) equivalent cloud object storage access operations that convert access operations to real objects located on a cloud object storage, or (c) a combination of the just-in-time translations (a) and the object access translations (b).

Optionally, in such a method, at least one of:

(3) the virtual file is in a different file format from the real file/object, wherein the compressed file format and the virtual file are in different file formats from each other; and

(4) the file format is a genome file format.

Alternatively, (3) and (4) are both applicable to the methods given in this disclosure.

Optionally, the compressed file format is a compressed genome file format and the another file format is another genome file format.

Referring to FIG. 2, a flowchart of the steps of a method implemented using data processing system 10 of FIG. 1 are shown.

The method comprises a first step 200 of providing a data processing system comprising data processing means, wherein the data processing means comprises computing hardware for executing one or more software products, wherein the execution of the one or more software products configures the data processing means to access data from a file system device.

The method further comprises a second step 210 of executing the data processing apparatus 10 to load a dynamic linker comprising an interception library to intercept file access operations of the executable software product, wherein:

(1) virtual files that do not exist in the filesystem device are accessible by the executable software product;

(2) the virtual file is the result of either: (a) the method comprises the steps of (a) immediate conversion of one or more real files in a file system device, wherein the immediate conversion is from one file format to another file format, (b) conversion of an access operation to an equivalent cloud object storage of a real object located on the cloud object storage, and (c) a combination of the immediate conversion (a) and the object access conversion (b).

Optionally, in the method of fig. 2, at least:

(3) the virtual file is in a file format different from the real file/object, wherein the compressed file format and the virtual file format are different file formats; and

(4) the file format is a genome file format.

Alternatively, (3) and (4) both belong to the method in fig. 2.

According to another aspect, a computer program product is provided that includes a non-transitory computer readable storage medium having computer readable instructions stored thereon that are executable by a computing device that includes processing hardware to perform the above-described method.

Modifications may be made to the embodiments of the present disclosure described hereinabove without departing from the scope of the present disclosure as defined by the appended claims. Expressions such as "comprising," "including," "containing," "consisting of," "having," "being," and the like, used for describing and claiming the invention are intended to be interpreted in a non-exclusive manner, i.e., in a manner that also allows for items, components, or elements not expressly described. Reference to the singular is also to be construed to relate to the plural. Numerals included within parentheses in the accompanying claims are intended to assist understanding of the claims and should not be construed in any way to limit subject matter claimed by these claims.

The phrases "in an embodiment," "according to an embodiment," and the like generally mean that a particular feature, structure, or characteristic following the phrase is included in at least one embodiment of the present disclosure, and may be included in more than one embodiment of the present disclosure. Importantly, such phrases are not necessarily referring to the same embodiment.

Appendix (inventor description)

Transparent access layer

Explanation of Dynamic Linker (Dynamic Linker) from wikipedia:

"dynamic linker can be affected during program execution or program linking to modify its behavior, and examples of it can be seen in the running linker handbook pages for a variety of Unix-like systems. A typical modification to this behavior is to use:

LD _ LIBRARY _ PATH and LD _ PRELOAD environment variables,

the linking process at runtime is modified by searching the shared libraries at different locations and forcing the loading and linking of libraries that otherwise would not have been possible to load and link, respectively. One example is

Figure BDA0002287015430000141

So, which when used with LD _ PRELOAD hash, facilitates transparent decompression (transparent decompression); thus, can be in BSD and

Figure BDA0002287015430000142

reading pre-compressed (gzipped) file data in the system, just like uncompressed files, actually allows the user to add transparent decompression on the base file system even if some warning is present. The mechanism is flexible, allowing fine-tuning of the same code before providing the data to the user process requesting it, so that additional or alternative processing of the data can be done when reading the file. "

Transparent access to genomic data (Main method)

The method comprises the following steps:

forcibly loading an interception library through a dynamic linker, wherein the interception library intercepts file access operations of the executable file, so that:

(1) virtual files that do not exist in the file system can be accessed;

(2) the virtual file is the instant conversion of a real file in a file system from one file to another file format; and

(3) where the real file is in a compressed genome file format and the virtual file is in another genome file format.

Alternate method for transparent access to genomic data 1:

system call interception is used, for example by ptrace.

Interpretation of ptrace from wikipedia:

"ptrace by debugger (e.g. using a debugger)gdbAnddbx)、straceandItraceequal-tracking tool andcode coveragethe tool is used. The application specific program also uses ptrace to patch running programs to avoid unrepaired errors or to resolve security features. It can be further used assandboxAnd a runtime environment simulator (like simulating root access of non-root software).

By attaching to another process using a ptrace call, the tool has extensive control over the operation of its target. This includes the file descriptor thereto (file descriptors) Memory and registers (registers) And (4) the operation of (1). It can be made ofsingle-stepCode of the target that can observe and intercept system calls and their results, and can manipulate the target's signals: (signal) Processes the process and receives and transmits signals for it. The ability to write to the target's memory allows not only to change its data storage, but also to change the application's own code fragments: (code segment) Thereby allowing the controller to install a breakpoint: (breakpoints) And patching the target's running code. "

The following is prior art about this access method, but does not involve a translation or virtual file system:

http://www.alfonsobeato.net/c/modifying-system-call-arguments-with- ptrace/

http://www.alfonsobeato.net/c/filter-and-modify-system-calls-with- seccomp-and-ptrace/

the method comprises the following steps:

intercepting a file system call by an executable file tracking function provided by a kernel, and forcibly intercepting (sub-process) the system call of an executable file, so that:

(1) virtual files that do not exist in the file system can be accessed;

(2) the virtual file is the instant conversion of a real file in a file system from one file format to another file format;

(3) wherein the real file is in a compressed genome file format and the virtual file is in another genome file format; and

(4) wherein a system call to open the virtual file is intercepted and processed by first ensuring that the installed virtual file system is available for use (perhaps in a temporary directory, where "ensuring" means that if a mount point does not already exist, the mount point is automatically created for the virtual file system so that it exists), and then redirecting the system call to the file on the virtual file system.

Alternative method for transparent access to genomic data 2:

this is based on Instrumentation such as intel's PIN.

Interpretation of PIN from wikipedia:

"in loading program intoAfter the storage, Pin performs instrumentation by control of the program. Then just-in-time recompilation using Pin before run (just-in-time recompilesJIT) a small portion of binary code. Adding the new instructions to perform the analysis to the recompiled code. These new instructions come from Pintool. A number of optimization techniques are used to achieve as low a run-time and memory usage overhead as possible. By 6 months 2010, the average base cost of Pin was 30% (no pintool running) ".

The method comprises the following steps:

immediately before running, by just-in-time recompiling a portion of binary code, a system call to a (sub-process) executable file is forcibly intercepted, such that:

(1) virtual files that do not exist in the file system can be accessed;

(2) the virtual file is the instant conversion of a real file in a file system from one file format to another file format;

(3) wherein the real file is in a compressed genome file format and the virtual file is in another genome file format; and

(4) wherein a system call to open the virtual file is intercepted and processed by first ensuring that the installed virtual file system is available for use (perhaps in a temporary directory, where "ensuring" means that if a mount point does not already exist, the mount point is automatically created for the virtual file system so that it exists), and then redirecting the system call to the file on the virtual file system.

Note that as with the primary interception method, the standby method can also work by intercepting real file read (write) and search operations on virtual files, not just open files. However, the method (4) also allows a more efficient intercepting method to be used, so that the operation of opening the file is redirected to the path of the installed virtual file system (such as FUSE under Linux), and the reading and searching operation of the real file is intercepted at the installed virtual file system layer instead of intercepting the corresponding system call.

Two access genesDetermination between methods of group data:

the method comprises the following steps:

wherein the intercepted executable is checked to see if it has dynamic library dependencies on the interceptable libraries (e.g., glibc), and if not, the primary method is replaced with the alternate access method. The standby method itself is selected based on whether standby method 1 is available for use by the executable file (e.g., whether there is sufficient user security rights) and whether standby method 2 is not used.

Transparent access to layered/segregated genomic data

The method comprises the following steps:

wherein the immediate conversion of transparent access to genomic data (and its alternatives) can merge content from multiple genomic files and display it as one genomic file, suitable for any one or combination of:

(1) wherein the merged content is quality score data;

(2) wherein the merged content is read name information;

(3) wherein the merged content is an ancillary tag of the mapped genomic read content;

(4) wherein the merged content comprises different genomic regions;

(5) wherein the pooled content comprises a plurality of genomic samples/specimens;

wherein the immediate conversion of transparent access to genomic data (and its surrogates) may take one genomic file and display it as multiple genomic files:

(1) wherein different genome files represent different regions of the genome, samples, or other separable portions.

Protecting interception capabilities through child processes

In Linux, the LD _ PRELOAD environment variable can be configured to load the intercept library, however this means that if a process modifies this environment variable or if a child process is called without this environment variable, the intercept functionality is lost. Similar environment variables exist in MacOS and BSD based execution systems. We call it an intercept environment variable.

The method comprises the following steps:

forcibly loading an interception library for intercepting file access operations of the executable file through a dynamic linker, so that:

(1) creation of a new (child) process retains the interception base in the interception environment variable.

Securing interception capabilities through a work submission system

Because work submissions are considered a security risk, the work submission system (e.g., an HPC system) may not retain the work submission's intercept environment variables.

The method comprises the following steps:

forcibly loading an interception library for intercepting file access operations of the executable file through a dynamic linker, so that:

(1) it detects whether the program is being submitted to the work submission system and, if so:

(2) creating a temporary shell script, wherein the shell script reserves an interception environment variable before calling an original program;

(3) optionally, detecting whether the program is a script containing specific metadata of the work submission system, and if so, copying the metadata information into a new temporary shell script; and

(4) the new transient script is submitted to the work submission system in place of the original program.

Extension of cloud storage

This allows access to and from cloud storage (e.g., AWS S3) as a virtual file. Unlike the prior art, this is a virtual file by the above-described interception method and using a file format different from the base object file format.

The method comprises the following steps:

according to transparent access to genomic data (and its alternatives), access under a virtual path (e.g.,/pgs 3/) is redirected to cloud storage by converting operations into equivalent, converted requests, which are sent via the internet to the provider of the cloud storage, and such that:

(1) corresponding virtual files that do not exist in cloud storage can be accessed;

(2) wherein a virtual file is an immediate conversion of a corresponding real object on cloud storage from one file format to another; and

(3) where the real object is in a compressed genome file format and the virtual file is in another genome file format.

For example, access to/pgs 3/mybucket/myfile.bam will redirect to exist at cloud object location s 3: cram's objects (with different file formats), and at s 3: bam has no corresponding object present.

Processing memory mapped files

Unfortunately, access to the memory mapped file is more difficult to handle when an interception library that intercepts file access operations for an executable file is forced to load through a dynamic linker. This is because such file accesses mean that they occur only by accessing storage locations within the memory-mapped file region, rather than as calls to interceptable library functions.

The method comprises the following steps:

according to the transparent access to the genome data (and the substitutes thereof), the memory mapped file access operation of the executable file to the virtual file is intercepted by the following modes:

(1) registering a page fault interrupt handler;

(2) creating a virtual area, but protecting the virtual area from reading and writing, the virtual area having a size required by a memory mapped file mapping operation;

(3) on a read access to the protected page, replacing the page with corresponding translation content from the real file (and optionally surrounding pages, or optionally pre-fetching subsequent pages) and allowing the page to be accessible to be read (and/or written);

(4) maintaining a list of the pages of translated content, and when a memory consumption limit is reached, releasing memory occupied by translated content, selecting one or more pages of the translated content, releasing memory of the pages and protecting these page regions again to prevent them from being read and written again; and

(5) where the selection of which page to release is made by using LRU (least recently utilized), LFU (least frequently utilized), or other replacement heuristics.

Standby method for processing memory mapping file

In accordance with transparent access to the genomic data (and its alternatives), it is made to intercept memory mapped file access operations of the executable file to the virtual file:

(1) wherein a system call to memory map the virtual file is intercepted and processed by first ensuring that the installed virtual file system is available for use (perhaps in a temporary directory, where "ensure" means that if a mount point does not already exist, the mount point is automatically created for the virtual file system to make it exist), and then redirecting the memory mapping operation to the file on the virtual file system.

19页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:近距离放射治疗处置规划系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!