System and method for dataflow graph optimization

文档序号:884180 发布日期:2021-03-19 浏览:4次 中文

阅读说明:本技术 用于数据流图优化的系统和方法 (System and method for dataflow graph optimization ) 是由 G·A·迪基 于 2019-05-22 设计创作,主要内容包括:一种存储处理器可执行指令的至少一个非暂态计算机可读存储介质,这些指令在由至少一个计算机硬件处理器执行时,使该至少一个计算机硬件处理器进行以下操作:获得自动生成的初始数据流图,该初始数据流图包括表示第一多个数据处理操作的第一多个节点和表示在该第一多个节点中的节点之间的数据流的第一多个链路;以及通过迭代地应用数据流图优化规则来更新该初始数据流图以生成更新的数据流图,该更新的数据流图包括表示第二多个数据处理操作的第二多个节点和表示该第二多个节点中的节点之间的数据流的第二多个链路。(At least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to: obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes; and updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes.)

1. At least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to:

obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes;

updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation; and

the updated dataflow graph is performed at least in part by performing the first data processing operation using a first computer system process and performing the second data processing operation using a second computer system process that is different from the first computer system process.

2. The at least one non-transitory computer-readable storage medium of claim 1, wherein the processor-executable instructions further cause the at least one computer hardware processor to:

a process layout is assigned to each of the one or more nodes in the updated dataflow graph.

3. The at least one non-transitory computer-readable storage medium of claim 2, wherein the updated dataflow graph is executed according to the assigned one or more process layouts.

4. The at least one non-transitory computer-readable storage medium of claim 1, wherein the second plurality of nodes has fewer nodes than the first plurality of nodes.

5. The at least one non-transitory computer-readable storage medium of claim 4, wherein the second plurality of links has fewer links than the first plurality of links.

6. The at least one non-transitory computer-readable storage medium of claim 1, wherein generating the updated dataflow graph includes:

selecting a first optimization rule;

identifying a first portion of the initial dataflow graph to which the first optimization rule is to be applied; and

the first optimization rule is applied to the first portion of the initial dataflow graph.

7. The at least one non-transitory computer-readable storage medium of claim 5, wherein generating the updated dataflow graph further includes:

selecting a second optimization rule different from the first optimization rule;

identifying a second portion of the initial dataflow graph to which the second optimization rule is to be applied; and

applying the second optimization rule to the second portion of the initial dataflow graph.

8. The at least one non-transitory computer-readable storage medium of claim 6, wherein the first portion of the initial dataflow graph is identified using a dataflow graph pattern matching language.

9. The at least one non-transitory computer-readable storage medium of claim 6, wherein identifying the first portion of the initial dataflow graph includes identifying a first node that represents a first data processing operation that is exchanged with a second data processing operation represented by a second node connected to the first node.

10. The at least one non-transitory computer-readable storage medium of claim 6, wherein applying the first optimization rule comprises applying an optimization selected from the group consisting of: redundant data processing operations are removed, intensity reduction optimization, combined operation optimization, width reduction optimization and deduplication optimization are performed.

11. The at least one non-transitory computer-readable storage medium of claim 1, wherein generating the updated dataflow graph includes: identifying a first node representing a redundant operation; and removing the first node from the initial dataflow graph.

12. The at least one non-transitory computer-readable storage medium of claim 1, wherein generating the updated dataflow graph includes: identifying two nodes in the initial dataflow graph that represent respective ordered data processing operations; and replacing the two nodes with a single node representing a sequenced data processing operation.

13. The at least one non-transitory computer-readable storage medium of claim 12, wherein the two nodes are not adjacent to each other in the initial data flow graph.

14. The at least one non-transitory computer-readable storage medium of claim 1, wherein generating the updated dataflow graph includes: the first node representing the first data processing operation is replaced with a second node representing a second data processing operation, the second data processing operation being of a weaker type than the first data processing operation.

15. The at least one non-transitory computer-readable storage medium of claim 1, wherein generating the updated dataflow graph includes: identifying two nodes in the initial dataflow graph that represent respective join data processing operations; and replacing the two nodes with a single node representing a join data processing operation.

16. The at least one non-transitory computer-readable storage medium of claim 1, wherein obtaining the automatically generated initial dataflow graph includes: the initial dataflow graph is automatically generated.

17. The at least one non-transitory computer-readable storage medium of claim 1, wherein automatically generating the initial dataflow graph includes:

obtaining a Structured Query Language (SQL) query;

generating a query plan for the SQL query; and

the initial dataflow graph is generated using the query plan.

18. A method, comprising:

using at least one computer hardware processor:

obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes;

updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation; and

the updated dataflow graph is performed at least in part by performing the first data processing operation using a first computer system process and performing the second data processing operation using a second computer system process that is different from the first computer system process.

19. The method of claim 18, further comprising:

assigning a process layout to each of one or more nodes in the updated dataflow graph,

wherein the execution is according to the assigned one or more process layouts.

20. The method of claim 18, wherein generating the updated dataflow graph includes:

selecting a first optimization rule;

identifying a first portion of the initial dataflow graph to which the first optimization rule is to be applied; and

the first optimization rule is applied to the first portion of the initial dataflow graph.

21. The method of claim 20, wherein identifying the first portion of the initial dataflow graph includes identifying a first node representing a first data processing operation that is to be exchanged with a second data processing operation represented by a second node connected to the first node.

22. The method of claim 20, wherein applying the first optimization rule comprises applying an optimization selected from the group consisting of: redundant data processing operations are removed, intensity reduction optimization, combined operation optimization, width reduction optimization and deduplication optimization are performed.

23. The method of claim 18, further comprising:

obtaining a Structured Query Language (SQL) query;

generating a query plan for the SQL query; and

the initial dataflow graph is generated using the query plan.

24. A data processing system comprising:

at least one computer hardware processor; and

at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to:

obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes;

updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation; and

the updated dataflow graph is performed at least in part by performing the first data processing operation using a first computer system process and performing the second data processing operation using a second computer system process that is different from the first computer system process.

25. The data processing system of claim 24, wherein the processor-executable instructions further cause the at least one computer hardware processor to:

assigning a process layout to each of one or more nodes in the updated dataflow graph,

wherein the execution is according to the assigned one or more process layouts.

26. The data processing system of claim 24, wherein generating the updated dataflow graph includes:

selecting a first optimization rule;

identifying a first portion of the initial dataflow graph to which the first optimization rule is to be applied; and

the first optimization rule is applied to the first portion of the initial dataflow graph.

27. The data processing system of claim 26, wherein identifying the first portion of the initial dataflow graph includes identifying a first node that represents a first data processing operation that is to be exchanged with a second data processing operation represented by a second node connected to the first node.

28. The data processing system of claim 27, wherein applying the first optimization rule comprises applying an optimization selected from the group consisting of: redundant data processing operations are removed, intensity reduction optimization, combined operation optimization, width reduction optimization and deduplication optimization are performed.

29. The data processing system of claim 24, wherein the processor-executable instructions further cause the at least one computer hardware processor to:

obtaining a Structured Query Language (SQL) query;

generating a query plan for the SQL query; and

the initial dataflow graph is generated using the query plan.

30. At least one non-transitory computer-readable storage medium storing processor-executable instructions, the processor-executable instructions comprising:

means for obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes;

means for updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation; and

means for performing the updated dataflow graph at least in part by performing the first data processing operation using a first computer system process and performing the second data processing operation using a second computer system process different from the first computer system process.

31. At least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to:

obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes; and

updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation, the generating including:

identifying a first portion of the initial dataflow graph to which a first optimization rule is to be applied at least in part by identifying a first node representing a first data processing operation that is to be exchanged with a second data processing operation represented by a second node connected to the first node; and

the first optimization rule is applied to the first portion of the initial dataflow graph.

32. The at least one non-transitory computer-readable storage medium of claim 31, wherein the processor-executable instructions further cause the at least one computer hardware processor to:

assigning a process layout to each of one or more nodes in the updated dataflow graph,

wherein the execution is according to the assigned one or more process layouts.

33. The at least one non-transitory computer-readable storage medium of claim 31, wherein the first data processing operation is a sort operation.

34. The at least one non-transitory computer-readable storage medium of claim 31, wherein applying the first optimization rule comprises applying an optimization selected from the group consisting of: redundant data processing operations are removed, intensity reduction optimization, combined operation optimization, width reduction optimization and deduplication optimization are performed.

35. The at least one non-transitory computer-readable storage medium of claim 31, wherein the processor-executable instructions further cause at least one computer hardware processor to:

obtaining a Structured Query Language (SQL) query;

generating a query plan for the SQL query; and

the initial dataflow graph is generated using the query plan.

36. At least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to:

obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes; and

updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation, the generating including:

selecting a first optimization rule to be applied to a first portion of the initial dataflow graph from: removing redundant data processing operation, intensity reduction optimization, width reduction optimization and deduplication optimization; and

the first optimization rule is applied to the first portion of the initial dataflow graph.

37. The at least one non-transitory computer-readable storage medium of claim 36, wherein the processor-executable instructions further cause the at least one computer hardware processor to:

assigning a process layout to each of one or more nodes in the updated dataflow graph,

wherein the execution is according to the assigned one or more process layouts.

38. The at least one non-transitory computer-readable storage medium of claim 36, wherein the processor-executable instructions further cause the at least one computer hardware processor to:

obtaining a Structured Query Language (SQL) query;

generating a query plan for the SQL query; and

the initial dataflow graph is generated using the query plan.

Background

A data processing system may process data using one or more computer programs. One or more computer programs utilized by the data processing system may be developed as dataflow graphs. A dataflow graph can include components (referred to as "nodes" or "vertices") that represent data processing operations to be performed on input data, and links between the components that represent flows of data. The nodes of the dataflow graph may include: one or more input nodes representing respective input data sets; one or more output nodes representing respective output data sets; and one or more nodes representing data processing operations to be performed on the data. Techniques for performing Computations encoded by a dataflow Graph are described in U.S. Pat. No. 5,966,072 entitled "Executing Computations Expressed as Graphs" and U.S. Pat. No. 7,716,630 entitled "manipulating Parameters for Graph-Based Computations" each of which is incorporated herein by reference in its entirety.

Disclosure of Invention

Some embodiments relate to a data processing system comprising: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to: obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes; and updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation; and performing the updated dataflow graph at least in part by performing the first data processing operation using a first computer system process and performing the second data processing operation using a second computer system process different from the first computer system process.

Some embodiments relate to a method comprising, using at least one computer hardware processor: obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes; and updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation; and performing the updated dataflow graph at least in part by performing the first data processing operation using a first computer system process and performing the second data processing operation using a second computer system process different from the first computer system process.

Some embodiments relate to at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to: obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes; and updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation; and performing the updated dataflow graph at least in part by performing the first data processing operation using a first computer system process and performing the second data processing operation using a second computer system process different from the first computer system process.

Some embodiments relate to at least one non-transitory computer-readable storage medium storing processor-executable instructions, the processor-executable instructions comprising: means for obtaining an automatically generated initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations and a first plurality of links representing data flows between nodes of the first plurality of nodes; and means for updating the initial dataflow graph by iteratively applying dataflow graph optimization rules to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations and a second plurality of links representing data flows between nodes of the second plurality of nodes, wherein the second plurality of nodes includes a node representing a first data processing operation and another node representing the second data processing operation; and means for performing the updated dataflow graph at least in part by performing the first data processing operation using a first computer system process and performing the second data processing operation using a second computer system process different from the first computer system process.

Some embodiments relate to a data processing system comprising: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to: obtaining a Structured Query Language (SQL) query; generating a query plan for the SQL query; generating an initial dataflow graph using the query plan, the initial dataflow graph including a first plurality of nodes representing a first plurality of data processing operations; and updating the initial dataflow graph by using at least one dataflow graph optimization rule to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations.

Some embodiments relate to a method comprising, using at least one computer hardware processor: obtaining a Structured Query Language (SQL) query; generating a query plan for the SQL query; generating an initial dataflow graph using the query plan, the initial dataflow graph including a first plurality of nodes representing a first plurality of data processing operations; and updating the initial dataflow graph by using at least one dataflow graph optimization rule to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations.

Some embodiments relate to at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to: obtaining a Structured Query Language (SQL) query; generating a query plan for the SQL query; generating an initial dataflow graph using the query plan, the initial dataflow graph including a first plurality of nodes representing a first plurality of data processing operations; and updating the initial dataflow graph by using at least one dataflow graph optimization rule to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations.

Some embodiments relate to at least one non-transitory computer-readable storage medium storing processor-executable instructions, the processor-executable instructions comprising: means for obtaining a Structured Query Language (SQL) query; means for generating a query plan for the SQL query; means for generating an initial dataflow graph that includes a first plurality of nodes representing a first plurality of data processing operations using the query plan; and means for updating the initial dataflow graph by using at least one dataflow graph optimization rule to generate an updated dataflow graph that includes a second plurality of nodes representing a second plurality of data processing operations.

The foregoing is a non-limiting summary of the invention, which is defined by the appended claims.

Drawings

Various aspects and embodiments will be described with reference to the following drawings. It should be understood that the drawings are not necessarily drawn to scale. Items appearing in multiple figures are indicated by the same or similar reference numbers in all of the figures in which they appear.

FIG. 1A is a block diagram of an illustrative computing environment in which some embodiments of the technology described herein may operate.

FIG. 1B is a flow diagram of an illustrative process for automatically generating a data flow graph from an input Structured Query Language (SQL) query in accordance with some embodiments of the technology described herein.

FIG. 2 is a flow diagram of an illustrative process 200 for automatically generating a data flow graph from an input SQL query in accordance with some embodiments of the technology described herein.

FIG. 3A illustrates the application of optimization rules to an illustrative data flow graph to remove one or more redundant data processing operations in accordance with some embodiments of the techniques described herein.

Fig. 3B illustrates changing the order of exchanging data processing operations to facilitate applying optimization rules to another illustrative data flow diagram in accordance with some embodiments of the technology described herein.

Fig. 3C illustrates the application of optimization rules to another illustrative data flow graph to remove one or more redundant data processing operations in accordance with some embodiments of the techniques described herein.

Fig. 3D illustrates the application of optimization rules to yet another illustrative dataflow graph to remove one or more redundant data processing operations in accordance with some embodiments of the technology described herein.

Fig. 4A illustrates the application of optimization rules to an illustrative data flow graph for strength reduction optimization in accordance with some embodiments of the techniques described herein.

FIG. 4B illustrates applying optimization rules to an illustrative data flow graph for another strength-reduction optimization in accordance with some embodiments of the techniques described herein.

FIG. 5A illustrates the application of optimization rules to an illustrative dataflow graph for combinatorial operation optimization in accordance with some embodiments of the technology described herein.

FIG. 5B illustrates applying optimization rules to another illustrative dataflow graph for combinatorial operation optimization in accordance with some embodiments of the technology described herein.

FIG. 5C illustrates the application of optimization rules to yet another illustrative dataflow graph for combinatorial operation optimization in accordance with some embodiments of the technology described herein.

FIG. 5D illustrates the application of optimization rules to yet another illustrative dataflow graph for combinatorial operation optimization in accordance with some embodiments of the technology described herein.

FIG. 6 illustrates the application of optimization rules to an illustrative dataflow graph to remove unreferenced data processing operations in accordance with some embodiments of the technology described herein.

FIG. 7 illustrates the application of optimization rules to an illustrative dataflow graph for breadth reduction optimization in accordance with some embodiments of the technology described herein.

FIG. 8A illustrates the application of optimization rules to an illustrative data flow graph for deduplication optimization in accordance with some embodiments of the techniques described herein.

FIG. 8B illustrates the application of optimization rules to an illustrative dataflow graph to perform deduplication optimization by zipper-wise combining in accordance with some embodiments of the technology described herein.

Fig. 9 illustrates the application of optimization rules to an illustrative dataflow graph for serial-to-parallel optimization in accordance with some embodiments of the technology described herein.

Fig. 10A illustrates an initial data flow diagram in accordance with some embodiments of the technology described herein.

FIG. 10B illustrates an updated data flow diagram obtained by iteratively applying optimization rules to the initial data flow diagram shown in FIG. 10A in accordance with some embodiments of the techniques described herein.

Fig. 10C illustrates another view of the initial data flow diagram of fig. 10A, in accordance with some embodiments of the techniques described herein.

Fig. 10D illustrates another view of the updated data flow diagram of fig. 10B in accordance with some embodiments of the techniques described herein.

FIG. 11 is a block diagram of an illustrative computing system environment that may be used to implement some embodiments of the described technology.

Detailed Description

Aspects of the technology described herein relate to improving the speed, throughput, and accuracy of a data processing system by improving conventional techniques for performing data processing operations using dataflow graphs.

Some data processing systems use dataflow graphs to process data. In many cases, the dataflow graph is automatically generated rather than manually specified. For example, some data processing systems may automatically generate dataflow graphs from Structured Query Language (SQL) queries. In this case, a user or computer program can provide an input SQL query to the data processing system, and the data processing system can execute the SQL query by generating a dataflow graph from the SQL query and executing the generated dataflow graph. As another example, a data processing system may receive a representation of an input query (which is not an SQL query) from a user or computer program, and may automatically generate a dataflow graph from the representation of the query. As yet another example, a data processing system may receive an input dataflow graph from another data processing system. The input dataflow graph may not be suitable for execution within the data processing system (even though the input dataflow graph may be suitable for execution within the other data processing system). Thus, the data processing system generates a new dataflow graph from the input dataflow graph that is suitable for execution on the data processing system.

The inventors have recognized that improvements can be made over conventional techniques for automatically generating dataflow graphs (e.g., from SQL queries, other query representations, or input dataflow graphs as discussed above). While an automatically generated dataflow graph may execute on a data processing system, the execution of a dataflow graph produced by conventional techniques for automatically generating dataflow graphs requires significant computing resources (e.g., processor resources, memory resources, network resources, etc.) and significant time. For example, an automatically generated dataflow graph: (1) may include nodes representing redundant data processing operations; (2) subsequent unused data processing operations may need to be performed; (3) serial processing may be unnecessarily required in the case where parallel processing is possible; (4) data processing operations may be applied to more data than is necessary to achieve the desired result; (5) the computation may be broken down to be performed on multiple nodes, which may significantly increase the computational cost of performing the computation with data processing on each dataflow graph node by a dedicated thread in a computer program, a dedicated computer program (e.g., a process in an operating system), and/or a dedicated computing device; (6) when a weaker type of data processing operation requiring less computation (e.g., a sort-in-group operation or a roll-up-in-group operation, etc.) is sufficient, a stronger type of data operation requiring more computation (e.g., a sort operation, a roll-up operation, etc.) may need to be performed; and/or (7) may require repeated processing.

The inventors have further recognized that even though some conventional optimization techniques are used as part of the dataflow graph generation process, the execution of dataflow graphs produced by conventional automated techniques for generating dataflow graphs may require significant computing resources and significant time. For example, a dataflow graph may be generated from an SQL query by generating a query plan from the SQL query and generating a dataflow graph from the generated query plan. However, even if the generation of the query plan involves some optimization, the resulting dataflow graph (generated from the query plan) may require a significant amount of computational resources. Indeed, conventional techniques for generating dataflow graphs from query plans often result in inefficiencies and may not be sophisticated enough to produce dataflow graphs that can be executed in a computationally efficient manner.

The inventors have recognized that the performance of a data processing system will be improved if the automatically generated dataflow graph is further processed and optimized to reduce the amount of computing resources used to execute the generated dataflow graph. To this end, the inventors developed some of the dataflow graph optimization techniques described in this application. The dataflow graph optimization techniques described herein improve the performance (e.g., throughput, speed, accuracy, etc.) of a data processing system by reducing the amount of computing resources (e.g., processor resources, memory resources, network resources, etc.) used to execute dataflow graphs generated at least in part by using the dataflow graph optimization techniques.

Another benefit of the dataflow graph optimization techniques described herein is that the dataflow graph optimizer that exists as part of the data processing system allows developers of other data processing system components and/or users of the data processing system to rely on the dataflow graph optimizer rather than attempting to perform ad hoc optimizations as part of their own work. The dataflow graph optimizer can not only reduce the work done by these developers and/or users, but also prevent them from inadvertently introducing errors into the data processing system, which of course improves the data processing system by reducing the number of errors.

It should be understood that the dataflow graph optimization techniques described herein may, but need not, result in a dataflow graph that is "optimal" in some sense. Instead, these optimization techniques generally attempt to improve the performance of the data processing system in executing a dataflow graph by altering the dataflow graph before executing the dataflow graph to increase the computational efficiency of executing the dataflow graph.

Some embodiments described herein address all of the above-described problems with conventional techniques for automatically generating dataflow graphs that the inventors have recognized. However, not every embodiment described herein addresses each of these issues, and some embodiments may not address any of these issues. As such, it should be understood that embodiments of the techniques described herein are not limited to addressing all or any of the above-discussed problems of conventional techniques for automatically generating dataflow graphs. For example, some embodiments of the techniques described herein may be applied to optimize manually specified dataflow graphs, as such dataflow graphs may also include problems that are inefficient and require more computing resources than necessary.

Accordingly, some embodiments provide novel techniques for automatically generating dataflow graphs from SQL queries and/or other inputs. Examples of such other inputs are provided herein. In some embodiments, the data processing system may: (1) obtaining a Structured Query Language (SQL) query; (2) generating a query plan for the SQL query; (3) generating an initial dataflow graph using the query plan; and (4) generating an updated dataflow graph by updating the initial dataflow graph with at least one dataflow graph optimization rule.

In some embodiments, the updated dataflow graph may be saved (e.g., in non-volatile memory) for later use. Additionally or alternatively, in some embodiments, the updated dataflow graph may be executed by a data processing system. Prior to execution, the data processing system may assign a process layout to each of one or more nodes of the updated dataflow graph.

In some embodiments, the initial dataflow graph may include a first plurality of nodes that represent a respective plurality of data processing operations to be performed while the data processing system executes the initial dataflow graph. The updated dataflow graph may include a second plurality of nodes representing a corresponding plurality of data processing operations to be performed while the data processing system executes the updated dataflow graph. In some embodiments, the second plurality of nodes has fewer nodes than the first plurality of nodes. In such an embodiment, the number of nodes in the updated dataflow graph is less than the number of nodes in the initial dataflow graph. The reduction in the number of nodes in the graph may reduce the amount of computing resources required to instead execute the updated dataflow graph relative to the computing resources that would be used by the data processing system when executing the initial dataflow graph.

In some embodiments, the data processing system may generate an updated dataflow graph from the initial dataflow graph by applying one or more dataflow graph optimization rules (examples of which are provided herein) to one or more portions of the initial dataflow graph. The optimization rules may be applied iteratively. For example, in some embodiments, the data processing system may update the initial dataflow graph by: (1) selecting a first optimization rule; (2) identifying a first portion of the initial dataflow graph to which the first optimization rule is to be applied; and (3) applying the first optimization rule to the first portion of the initial dataflow graph. Subsequently, the data processing system may continue to update the initial dataflow graph by: (1) selecting a second optimization rule different from the first optimization rule; (2) identifying a second portion of the initial dataflow graph to which the second optimization rule is to be applied; and (3) applying the second optimization rule to the second portion of the initial dataflow graph.

For the initial dataflow graph being updated, a number of ways of optimizing the application of rules may be considered. For example, in some embodiments, for each particular optimization rule, the data processing system may identify a portion of the dataflow graph to which the particular optimization rule applies and apply the optimization rule to the identified portion. As another example, in some embodiments, for each particular portion of the dataflow graph, the data processing system may identify optimization rules that may be applied to that particular portion and apply the identified optimization rules to that particular portion. In such embodiments, the initial dataflow graph may be topologically ordered, and the topologically ordered graph may be traversed (e.g., left to right) to identify the particular portion to which the optimization rule may be applied.

In some embodiments, the data processing system may employ a dataflow graph pattern matching language to identify one or more portions of the initial dataflow graph to which one or more optimization rules are to be applied. The data flow subgraph pattern matching language can include one or more expressions that represent respective patterns to be identified in the data flow graph. Examples of such expressions are provided herein.

In some embodiments, to identify the portion of the initial dataflow graph to which a particular optimization rule is to be applied, the data processing system may determine whether one or more nodes are swappable with one or more other nodes. In other words, the data processing system may determine whether the order of appearance of one or more nodes in the dataflow graph can be changed without changing the processing results. This is a valuable feature because if there are switching nodes, the optimization rule may become applicable to a portion of the graph by changing the order of at least some of the switching nodes, while otherwise the rule may not apply.

For example, the optimization rule may involve identifying two adjacent nodes in the initial dataflow graph that represent respective sort operations, where the second sort operation negates the effect of the first operation, and therefore the first operation should be deleted (see, e.g., the examples shown in fig. 3B and 3C). By definition, such optimization rules will not be applied to dataflow graphs that do not have neighboring nodes representing sort operations. However, if a first node representing a first sort operation is to be exchanged with one or more other nodes, the order of the first node and at least one of the one or more other nodes may be changed such that the first node representing the first sort operation is placed adjacent to a second node representing a second sort operation. As a result of swapping nodes in this manner, optimization rules of the first sort operation that remove redundancy may be applied to the dataflow graph. Thus, in some embodiments, identifying a first portion of an initial dataflow graph may include identifying a first node that represents an ordered data processing operation that is exchanged with a data processing operation represented by a second node connected to the first node.

In some embodiments, any one or more of a variety of types of optimization rules may be applied when generating an updated dataflow graph from an initial dataflow graph. By way of example and not limitation, applying optimization rules to an initial dataflow graph may involve removing one or more redundant data processing operations, removing one or more unreferenced data processing operations, performing one or more intensity reduction optimizations, performing one or more combinatorial operation optimizations, performing one or more width reduction optimizations, and/or performing one or more deduplication optimizations.

In some embodiments, the optimization rules may be embodied in program code that, when executed, causes a corresponding optimization of the dataflow graph. For example, optimization rules for removing redundancy may be embodied in program code that: the program code, when executed, causes removal (from a dataflow graph to which the rule is to be applied) of at least one node representing a data processing operation determined to be redundant. The program code may be written in any programming language, as aspects of the techniques described herein are not limited in this respect.

As yet another example, optimization rules for removing one or more unreferenced data processing operations may be embodied in program code that: the program code, when executed, causes removal (from a dataflow graph to which the rule is to be applied) of at least one node representing a data processing operation for which a result is unreferenced and/or unused (e.g., an order data processing operation that is unreferenced because the order produced by the ordering is not needed or relied upon in subsequent processing).

As yet another example, an optimization rule for doing the strength reduction may be embodied in program code that, when executed, causes a first node representing a first data processing operation (e.g., a node representing a sorted data processing operation) to be replaced (in a dataflow graph to which the rule is to be applied) by a second node representing a second data processing operation of a weaker type than the first data processing operation (e.g., a node representing a sorted data processing operation within a group).

As yet another example, optimization rules for performing combinatorial operation optimization may be embodied in program code that: the program code, when executed, causes a plurality of nodes representing a plurality of operations (in a dataflow graph to which the rule is to be applied) to be replaced with a single node representing a combination of the plurality of operations.

As yet another example, optimization rules for performing the width reduction optimization may be embodied in program code that: the program code, when executed, causes some data (e.g., one or more columns of data) to be deleted at a particular portion of the graph before subsequent operations are performed, because the data (i.e., the deleted data) is not used in subsequent operations and need not be propagated as part of the processing. As yet another example, a node in a dataflow graph may be configured to perform several calculations, and the results of some of these calculations may be unused. Thus, in some embodiments, optimization rules for performing the width reduction optimization may be embodied in program code that: the program code, when executed, causes the particular node to be replaced by another node configured to perform only those calculations whose results are used; unnecessary calculations are no longer performed.

As yet another example, optimization rules for performing deduplication optimization may be embodied in program code that: the program code, when executed, causes merging of different branches of the dataflow graph to which the rule is to be applied.

It should be appreciated that the techniques introduced above and discussed in more detail below may be implemented in any of a variety of ways, as the techniques are not limited to any particular implementation. Examples of implementation details are provided herein for illustrative purposes only. Further, the techniques disclosed herein may be used alone or in any suitable combination, as aspects of the techniques described herein are not limited to use with any particular technique or combination of techniques.

FIG. 1A is a diagram of an illustrative data processing system 100 in accordance with some embodiments of the technology described herein. As illustrated in fig. 1A, the data processing system 100 includes a query input module 104, a query plan generator 106, a dataflow graph generator 108, a graph optimizer 110, a layout assignment module 112, and a graph execution engine 115.

Data processing system 100 is configured to access (e.g., read data from and/or write data to) data storage devices 102-1, 102-2, …, and 102-n. Each of the data storage devices 102-1, 102-2, …, and 102-n may store one or more data sets. The data storage device may store any suitable type of data in any suitable manner. The data store may store data as flat text files, spreadsheets, using a database system (e.g., a relational database system) or in any other suitable manner. In some instances, the data storage device may store transaction data. For example, the data storage device may store credit card transactions, phone log data, or bank transaction data. It should be appreciated that data processing system 100 may be configured to access any suitable number of data storage devices of any suitable type, as aspects of the techniques described herein are not limited in this respect. The data storage device from which data processing system 100 may be configured to read data may be referred to as a data source. The data storage device to which data processing system 100 may be configured to write data may be referred to as a data sink.

In some embodiments, the data stores 102-1, 102-2, …, 102-n may be of the same type (e.g., all may be relational databases) or of different types (e.g., one may be a relational database and the other may be a data store that stores data in flat files). The data storage may be an SQL server data storage, an ORACLE data storage, a TERADATA data storage, a flat file data storage, a multi-file data storage, a HADOOP data storage, a DB2 data storage, a Microsoft SQL server data storage, an informax data storage, an SAP data storage, a MongoDB data storage, a metadata data storage, and/or any other suitable type of data storage, as aspects of the techniques described herein are not limited in this respect.

In some embodiments, the query input module 104 may be configured to receive an input SQL query. In some embodiments, the query input module 104 may be configured to receive an input SQL query from a user. For example, the query input module 104 may be configured to generate a graphical user interface through which a user may input an SQL query. As another example, the query input module 104 may be configured to receive information provided by a user through a graphical user interface (which is not necessarily generated by the query input module 104 itself). In some embodiments, the query input module 104 may be configured to receive an input SQL query from another computer program. For example, the query input module 104 may expose Application Programming Interfaces (APIs) (e.g., an open database interconnect (ODBC) API and a Java database interconnect (JDBC) API) through which the input SQL query may be provided, may access the SQL query in response to a notification indicating that the SQL query is to be accessed, or may receive the input SQL query from another computer program in any other suitable manner.

The SQL queries received by the query input module 104 may involve reading data from and/or writing data to a single data storage device. Alternatively, the SQL query received by the query input module 104 may involve reading data from and/or writing data to a plurality of data storage devices. When the data stores are of different types, the SQL query may be referred to as a federated SQL query. In some embodiments, an SQL query may involve reading data from and/or writing data to a federated database.

In some embodiments, the query plan generator 106 is configured to generate a query plan from the SQL query received by the query input module 104. If an SQL query is executed, the generated query plan may identify one or more data processing operations to be performed. The generated query plan may further specify an order in which the identified data processing operations are to be performed. As such, the generated query plan may represent a series of data processing operations to be performed in order to execute the SQL query received by the query input module 104. Query plan generator 106 may be configured to generate a query plan in any suitable manner. For example, in some embodiments, the query plan generator 106 may implement any of the techniques for generating query plans described in U.S. patent No. 9,116,955 entitled "Managing Data Queries," which is incorporated by reference herein in its entirety.

In some embodiments, the dataflow graph generator 108 is configured to generate an initial dataflow graph from a query plan generated by the query plan generator 106. The dataflow graph generator 108 may be configured to generate the initial dataflow graph from the query plan in any suitable manner. For example, in some embodiments, the dataflow graph generator 108 may implement any of the techniques for generating a query plan described in U.S. patent No. 9,116,955 entitled "Managing Data Queries," which is incorporated by reference herein in its entirety.

In some embodiments, a dataflow graph may include components (referred to as "nodes" or "vertices") that represent data processing operations to be performed on input data, and links between the components that represent flows of data. The nodes of the dataflow graph may include: one or more input nodes representing respective input data sets; one or more output nodes representing respective output data sets; and one or more nodes representing data processing operations to be performed on the data. In some embodiments, the input nodes may represent a federated database or any other type of database. Similarly, in some embodiments, the output nodes may represent a federated database or any other type of database.

In some embodiments, different respective computer system processes may be used to perform different data processing operations represented by different nodes in the dataflow graph. For example, a dataflow graph may include a first node representing a first data processing operation (e.g., an "ordering" operation) and a second node representing a second data processing operation (e.g., a "joining" operation) that is different from the first data processing operation, and in some embodiments, the first data processing operation may be performed using a first computer system process and the second data processing operation may be performed using a second computer system process that is different from the first computer system process. In some embodiments, the first computer system process and the second computer system process may execute on the same computing device and may be managed by the same operating system, for example. In other embodiments, the first computer process and the second computer system process may execute on different computing devices.

In some embodiments, a computer system process for performing a data processing operation represented by a node in a dataflow graph may be an instance of a computer program configured to execute processor-executable instructions for encoding the data processing operation. The computer system process may be a single threaded process or a multi-threaded process. A computer system process may be associated with one or more computer system resources including, by way of example and not limitation: representing processor-executable instructions encoding data processing operations, memory (e.g., areas of physical memory and/or virtual memory that hold executable code, process-specific input data and/or output data, call stacks, compute heaps, and/or other data), a process identifier (e.g., an identifier used by an operating system to identify a computer system process), security attributes (e.g., indicating one or more owners of processes and/or permissions for operations allowed by the computer system process), and/or information specifying a state of a computer system process.

In some embodiments, an initial dataflow graph may be generated from a query plan at least in part by: an initial dataflow graph is generated to include nodes for each of at least a subset (e.g., some or all) of the data processing operations identified in the query plan. The order of data processing operations specified in the query plan may then be used to generate links connecting the nodes in the initial dataflow graph. For example, when the generated query plan indicates that a first data processing operation is to be performed before a second data processing operation, the generated initial dataflow graph may have a first node (representing the first data processing operation) and a second node (representing the second data processing operation) and one or more links that specify a path from the first node to the second node.

In some embodiments, generating the initial dataflow graph in accordance with the query plan includes adding one or more nodes to the graph representing input and/or output data sources. For example, generating the initial dataflow graph may include adding an input node for each data source from which a data record is to be read during execution of the SQL query. Each input node may be configured with parameter values associated with a respective data source. These values may indicate how to access the data records in the data source. As another example, generating the initial dataflow graph may include adding an output node for each data receiver to which a data record is to be written during execution of the SQL query. Each output node may be configured with parameter values associated with a respective data receiver. These values may indicate how the data records are written to the data source. In some embodiments, the initial dataflow graph may be executed by a graph execution engine. In other embodiments, the initial dataflow graph is not executable by the graph execution engine.

In some embodiments, the graph optimizer 110 is configured to generate an updated dataflow graph by updating the initial dataflow graph generated by the dataflow graph generator 108 using one or more dataflow graph optimization rules. The graph optimizer 110 may be configured to apply any one or more of the various types of optimization rules described herein to the initial dataflow graph. For example, the graph optimizer 110 may be configured to update the initial dataflow graph by: removing one or more redundant data processing operations, removing one or more unreferenced data processing operations, performing one or more intensity reduction optimizations, performing one or more combinatorial operation optimizations, performing one or more width reduction optimizations, and/or performing one or more deduplication optimizations. The graph optimizer 110 may be configured to operate in any suitable manner, and may be configured to operate according to the illustrative process 200 described with reference to fig. 2, or one or more variations thereof, for example.

In some embodiments, the layout allocation module 112 may determine a processing layout for each of the one or more data processing operations represented by the corresponding node in the updated dataflow graph generated by the graph optimizer 110. The processing layout of the data processing operation may specify how many computing devices are to be used to perform the data processing operation, and may identify a particular computing device to be used to perform the data processing operation. Thus, in some embodiments, the layout allocation module 112 may determine, for each of one or more nodes in the updated dataflow graph, whether a single device (e.g., a single processor, a single virtual machine, etc.) or multiple devices (e.g., multiple processors, multiple virtual machines, etc.) are to be used and which devices should be used for data processing operations. In some embodiments, the layout assignment module may assign different degrees of parallelism to different nodes in the updated dataflow graph. As such, it should be appreciated that different processing layouts can be assigned to different data processing operations to be performed during execution of the updated dataflow graph generated by the graph optimizer 110.

In some embodiments, the updated dataflow graph may include multiple (e.g., two or more) nodes representing different data processing operations, and different processes may be used to perform the data processing operations. For example, one or more computer system processes may be used to perform a data processing operation represented by a first node (e.g., multiple computer system processes may be used when the data processing operation is parallelized), and one or more other computer system processes may be used to perform a data processing operation represented by a second node in the updated dataflow graph that is different from the first node.

In some embodiments, graph execution engine 115 is configured to execute one or more dataflow graphs. For example, in some embodiments in which the initial dataflow graph is operable, the graph execution engine 115 may be configured to execute any initial dataflow graph generated by the dataflow graph generator 108. As another example, the graph execution engine 115 may be configured to execute any updated dataflow graph generated by the graph optimizer 110. The graph execution engine may include a collaboration system or any other suitable execution environment for executing dataflow graphs. Aspects of the environment for developing and Executing dataflow Graphs are described in U.S. Pat. No. 5,966,072 entitled "Executing Computations represented as Graphs" and U.S. Pat. No. 7,716,630 entitled "manipulating Parameters for Graphs-Based Computations" each of which is incorporated herein by reference in its entirety.

FIG. 1B is a flow diagram of an illustrative process 120 for automatically generating a data flow graph from an input Structured Query Language (SQL) query in accordance with some embodiments of the technology described herein. Process 120 may be performed by any suitable data processing system, such as data processing system 100 described with reference to FIG. 1A.

Process 120 begins at act 122, where an SQL query is received. The query input module 104 can be used to receive SQL queries. This may be done in any suitable manner, including in any of the manners described with reference to act 202 of process 200.

Next, process 120 proceeds to act 124, where a query plan is generated from the SQL query received at act 122. Query plan generator 106 may be used to generate a query plan. This may be done in any suitable manner, including in any of the manners described with reference to act 204 of process 200.

Next, process 120 proceeds to act 126, where an initial dataflow graph is generated from the query plan obtained at act 124. An initial dataflow graph may be generated by the dataflow graph generator 108. This may be done in any suitable manner, including in any of the manners described with reference to act 206 of process 200.

Next, process 120 proceeds to act 128, where an updated dataflow graph is generated from the initial dataflow graph by applying one or more optimization rules to the initial dataflow graph. An updated dataflow graph may be generated by the graph optimizer 110. This may be done in any suitable manner, including in any of the manners described with reference to act 207 of process 200.

The updated dataflow graph may be stored for later use or may be executed by a data processing system. Prior to execution, the process layout can be assigned to one or more data processing operations represented by nodes in the updated dataflow graph. The process layout may be assigned to the data processing operation by the layout assignment module 112.

FIG. 2 is a flow diagram of an illustrative process 200 for automatically generating a data flow graph from an input SQL query in accordance with some embodiments of the technology described herein. Process 200 may be performed using any suitable data processing system, including, for example, data processing system 100 described with reference to FIG. 1A.

Process 200 begins at act 202, where an SQL query is received. In some embodiments, the SQL query may be received by the data processing system executing process 200 as a result of a user providing the SQL query as input to the data processing system. The user may enter the SQL query through a graphical user interface or any other suitable type of interface. In other embodiments, the SQL query may be provided to the data processing system by another computer program. For example, the SQL query may be provided by a computer program configured to cause a data processing system to execute one or more SQL queries, each of which may have been specified by a user or automatically generated. The SQL query may be of any suitable type and may be provided in any suitable format, as aspects of the techniques described herein are not limited in this respect.

Next, process 200 proceeds to act 204, wherein a query plan is generated from the SQL query received at act 202. If an SQL query is executed, the generated query plan may identify one or more data processing operations to be performed. The generated query plan may further specify an order in which the identified data processing operations are to be performed. As such, the generated query plan may represent a series of data processing operations to be performed in order to execute the SQL query received at act 202. Any suitable type of query plan generator (e.g., query plan generator 106) may be used to generate the generated query plan. Some illustrative techniques for generating query plans are described in U.S. patent No. 9,116,955 entitled "Managing Data Queries," which is incorporated by reference herein in its entirety.

Next, process 200 proceeds to act 206, wherein an initial dataflow graph is generated from the query plan generated at act 204 using the SQL query received at act 202. In some embodiments, an initial dataflow graph may be generated from a query plan at least in part by: an initial dataflow graph is generated to include nodes for each of at least a subset (e.g., some or all) of the data processing operations identified in the query plan. In some embodiments, a single node in the query plan may result in multiple nodes being included in the initial dataflow graph. The order of data processing operations specified in the query plan may then be used to generate links connecting the nodes in the initial dataflow graph. For example, when the generated query plan indicates that a first data processing operation is to be performed before a second data processing operation, the generated initial dataflow graph may have a first node (representing the first data processing operation) and a second node (representing the second data processing operation) and one or more links that specify a path from the first node to the second node.

In some embodiments, generating the initial dataflow graph in accordance with the query plan includes adding one or more nodes to the graph representing input and/or output data sources. For example, generating the initial dataflow graph may include adding an input node for each data source from which a data record is to be read during execution of the SQL query. Each input node may be configured with parameter values associated with a respective data source. These values may indicate how to access the data records in the data source. As another example, generating the initial dataflow graph may include adding an output node for each data receiver to which a data record is to be written during execution of the SQL query. Each output node may be configured with parameter values associated with a respective data receiver. These values may indicate how the data records are written to the data source.

It should be appreciated that the initial dataflow graph generated at act 206 is different than the query plan generated at act 204. The dataflow graph can be executed by a graph execution engine (e.g., graph execution engine 115) while a query plan cannot be executed by the graph execution engine — the query plan is an intermediate representation used to generate a dataflow graph that is executed by the graph execution engine to execute an SQL query. Query plans are non-executable and require further processing to generate execution policies, even in the context of a relational database management system. In contrast, a dataflow graph may be executed by a graph execution engine to perform an SQL query. In addition, the resulting execution policy does not allow data to be read from and/or written to other types of data sources and/or data sinks, even after further processing by the relational database system, although the dataflow graph is not limited in this respect.

In some embodiments, the initial dataflow graph generated at act 206 may contain nodes that represent data processing operations that are not in the query plan generated at act 204. Conversely, in some embodiments, the initial dataflow graph generated at act 206 may not contain nodes that represent data processing operations in the query plan generated at act 204. Such situations may arise from various optimizations that may be performed during the process of generating a dataflow graph from a query plan. In some embodiments, the initial dataflow graph generated at act 206 may contain nodes that represent data processing operations other than the database operation that is being performed on the database computer system (e.g., a relational database management system).

In some embodiments, the query plan and dataflow graph may be implemented in different types of data structures. For example, in some embodiments, a query plan may be implemented in a directed graph in which each node has a single parent node (e.g., a tree such as a binary tree), while a dataflow graph may be implemented in a directed acyclic graph, possibly having at least one node with multiple parents.

Next, process 200 proceeds to act 207, where the initial dataflow graph is updated to obtain an updated dataflow graph. This may be accomplished in any of a variety of ways. For example, in the illustrated embodiment, a dataflow graph optimization rule is selected at act 208. Next, the data processing system performing process 200 identifies a portion of the initial dataflow graph to which optimization rules are to be applied at act 208. At act 212, the selected optimization rule is applied to the identified portion of the graph. Next, the process 200 proceeds to decision block 214, where it is determined whether there are optimization rules to be applied to at least yet another portion of the dataflow graph. If it is determined that there are optimization rules to be applied to at least yet another portion of the graph (e.g., the optimization rule selected at act 208 may be applied to another portion of the graph than the portion identified at act 210, different optimization rules may be selected together, etc.), process 200 returns to act 208. Otherwise, process 200 proceeds to act 216.

In some embodiments, for each particular optimization rule selected at act 208, the data processing system may identify a portion of the dataflow graph that applies the selected optimization rule and apply the optimization rule to the identified portion. Once all such portions are identified, different optimization rules may be selected. However, previously applied optimization rules may also be selected such that the same optimization rule may be considered for application to the dataflow graph multiple times (which may result in a better dataflow graph than an approach in which an optimization rule is greedy selected and no longer used after being applied once). The optimization rules may be selected in any suitable order, as aspects of the techniques described herein are not limited in this respect. As one example, after deduplication optimization is performed, any nodes representing redundant operations may be removed, and any empty nodes may be removed. After the empty node is removed, a width reduction optimization operation or the like may be performed.

In some embodiments, the order of acts 208 and 210 may be changed. In such embodiments, the data processing system may first identify a portion of the dataflow graph, and then select an optimization rule that may be applied to the identified portion of the dataflow graph. In such embodiments, the initial dataflow graph may be topologically ordered, and the topologically ordered graph may be traversed (e.g., left to right) to identify the particular portion to which the optimization rule may be applied.

In some embodiments, the data processing system may employ a dataflow graph pattern matching language to identify one or more portions of the initial dataflow graph to which one or more optimization rules are to be applied. The dataflow graph pattern matching language may include one or more expressions that identify particular types of subgraphs in the dataflow graph. In some embodiments, the data processing system performing process 200 may be configured to use expressions in a subgraph pattern matching language to identify portions of a dataflow graph to which one or more optimization rules are to be applied. The particular expression may facilitate identifying one or more portions to which a particular optimization rule or optimization rules are to be applied. In some embodiments, when a dataflow graph optimizer (e.g., graph optimizer 110) is configured with one or more new optimization rules, the graph optimizer may be configured with one or more new expressions written in a subgraph pattern matching language to facilitate identifying portions of a dataflow graph to which the new optimization rule(s) may be applied.

For example, the pattern matching language may include an expression for identifying a series of nodes of at least a threshold length (e.g., at least two, three, four, five, etc.) that represent a corresponding series of computations that may be combined and represented by a single node in the graph using the combining operation optimization rules. Identifying such patterns may facilitate application of combinatorial operational optimization rules, which will be further described below, including with reference to fig. 5A-5D. One non-limiting example of such an expression is "A → B → C → D", which may help identify a series of four consecutive data processing operations that may be combined.

As another example, the pattern matching language may include expressions for identifying portions of a dataflow graph where certain types of nodes may be exchanged with other nodes. This may facilitate applying a variety of different types of optimization rules to the dataflow graph. This allows the data processing system to consider making changes to the structure of the dataflow graph (as allowed by the degrees of freedom available through the swap operation) to identify portions to which optimization rules may be applied when the data processing system determines that the order of one or more nodes in the dataflow graph can be changed without changing the processing results. As a result of considering the exchange-based changes, one or more optimization rules may become applicable to portions of the graph that would otherwise not apply to the one or more rules.

For example, the optimization rule may involve identifying two adjacent nodes in the initial dataflow graph that represent respective sort operations, where the second sort operation negates the effect of the first operation, and therefore the first operation should be deleted (see, e.g., the examples shown in fig. 3B and 3C). By definition, such optimization rules will not be applied to dataflow graphs that do not have neighboring nodes representing sort operations. However, if a first node representing a first sort operation is to be exchanged with one or more other nodes, the order of the first node and at least one of the one or more other nodes may be changed such that the first node representing the first sort operation is placed adjacent to a second node representing a second sort operation. As a result of swapping nodes in this manner, optimization rules of the first sort operation that remove redundancy may be applied to the dataflow graph.

Thus, in some embodiments, the subgraph matching language can include one or more expressions of a subgraph that identifies a situation of the dataflow graph that conforms to a situation in which the order of the nodes in the dataflow graph can change. As one example, the expression "a → (…) → B" (where each of a and B may be any suitable data processing operation, such as sorting, merging, etc.) may be used to find portions of the data flow graph having: node "a" (i.e., the node representing operation "a") and node B (representing operation B), as well as one or more nodes exchanged with node a between node a and node B (e.g., in the case of changing the order of the nodes, the results of the processing by these nodes do not change). If such a portion is identified, the dataflow graph may be altered by moving node A adjacent to node B to obtain a portion "AB". As a specific example, if the dataflow graph has node ACDB, and operation a is exchanged with operations C and D, the dataflow graph may be changed to "CDAB". Further, the data processing system may consider whether the optimization rule applies to the portion "AB". For example, if operation a is an ordering and operation B is an ordering, the data processing system may attempt to determine whether both orderings may be replaced with a single ordering as in the example of fig. 5B.

As another example, "a → (…) → B →" may be used to find portions of the data flow graph having: node a, a second node B, and one or more nodes exchanged with node B between the two nodes. As a specific example, if the dataflow graph has node ACDB, and operation B is exchanged with operations C and D, the dataflow graph may be changed to "ABCD". Further, the data processing system may consider whether the optimization rule applies to the portion "AB".

As another example, the expression "a → (…) → B ″" may be used to find portions of the data flow graph having: node a, node B, and one or more nodes (e.g., C and D) between node a and node B that are not exchanged with node B. In this case, the system may attempt to "force" the exchange, wherein nodes C and D are pushed to the left of node A, if possible. As a specific example, if the dataflow graph has node ACEDB, and operation B exchanges with operation E but not operations C and D, the dataflow graph may be changed to "CDABE" -B is exchanged with E and C and D are pushed to the left of A.

As yet another example, the expression "a → (…) → B" may be used to find portions of the data flow graph having: node a, node B, and one or more nodes (e.g., C and D) between node a and node B that are not exchanged with node a. In this case, the system may attempt to "force" the exchange, wherein nodes C and D are pushed to the right of node B, if possible. As a specific example, if the dataflow graph has node acesdb, and operation a exchanges with operation E but not operations C and D, the dataflow graph may be changed to "EABCD" -exchanging a with E and pushing C and D to the right of a.

It should be understood that the example of the expression of the subgraph matching language described above is illustrative. In some embodiments, one or more other expressions may also be part of the subgraph matching language, in addition to or instead of the examples described above.

In some embodiments, at act 207, any one or more of a variety of types of optimization rules may be applied when generating the updated dataflow graph from the initial dataflow graph. For example, applying the optimization rule to the initial dataflow graph may involve removing one or more redundant data processing operations, removing one or more unreferenced data processing operations, performing one or more intensity reduction optimizations, performing one or more combinatorial operation optimizations, performing one or more width reduction optimizations, and/or performing one or more deduplication optimizations.

In some embodiments, the optimization rules may be embodied in program code that, when executed, causes a corresponding optimization of the dataflow graph. For example, optimization rules for removing redundancy may be embodied in program code that: the program code, when executed, causes removal (from a dataflow graph to which the rule is to be applied) of at least one node representing a data processing operation determined to be redundant. An example of applying optimization rules to a dataflow graph to remove one or more redundant data processing operations is illustrated in fig. 3A-3C, as described in more detail below.

As another example, an optimization rule for performing the strength reduction may be embodied in program code that, when executed, causes a first node representing a first data processing operation (e.g., a node representing a sorted data processing operation) to be replaced (in a dataflow graph to which the rule is to be applied) by a second node representing a second data processing operation of a weaker type than the first data processing operation (e.g., a node representing a sorted data processing operation within a group). An example of applying optimization rules to a dataflow graph for strength reduction optimization is illustrated in fig. 4A and 4B, as described in more detail below.

As another example, optimization rules for performing combinatorial operation optimization may be embodied in program code that: the program code, when executed, causes a plurality of nodes representing a plurality of operations (in a dataflow graph to which the rule is to be applied) to be replaced with a single node representing a combination of the plurality of operations. An example of applying optimization rules to a dataflow graph for combinatorial operation optimization is illustrated in fig. 5A-5D, as described in more detail below.

As yet another example, optimization rules for removing one or more unreferenced data processing operations may be embodied in program code that: the program code, when executed, causes removal (from a dataflow graph to which the rule is to be applied) of at least one node representing a data processing operation for which a result is unreferenced and/or unused (e.g., an order data processing operation that is unreferenced because the order produced by the ordering is not needed or relied upon in subsequent processing). An example of applying such optimization rules to a dataflow graph is illustrated in fig. 6, as described in more detail below.

As another example, optimization rules for performing the width reduction optimization may be embodied in program code that: the program code, when executed, causes some data (e.g., one or more columns of data, rows of data, etc.) to be deleted at a particular portion of the graph before subsequent operations are performed, because the data (i.e., the deleted data) is not used in subsequent operations and need not be propagated as part of the processing. An example of applying such optimization rules to a dataflow graph is illustrated in fig. 7, as described in more detail below.

As another example, optimization rules for performing deduplication optimization may be embodied in program code that: the program code, when executed, causes merging of different branches of the dataflow graph to which the rule is to be applied. An example of applying such optimization rules to a dataflow graph is illustrated in fig. 8A and 8B, as described in more detail below.

As another example, optimization rules for serial-to-parallel optimization may be embodied in program code that: the program code, when executed, causes the serially performed processes to be performed in parallel. An example of applying such optimization rules to a dataflow graph is illustrated in fig. 9, as described in more detail below.

It should be understood that the optimization rules and optimizations described above are illustrative, non-limiting examples. As part of process 200, one or more other optimization rules and/or optimizations may be applied to the initial dataflow graph in lieu of or in addition to the optimization rules and/or optimizations described above.

Next, process 200 proceeds to act 216, where the updated dataflow graph is output. In some embodiments, at act 216, the updated dataflow graph may be saved (e.g., stored in non-volatile memory) for later use.

Additionally or alternatively to storing, an updated dataflow graph may be performed. In some embodiments of performing the updated dataflow graph, at act 218 of process 200, a process layout is assigned to one or more nodes of the updated dataflow graph. A processing layout for a node representing a data processing operation may specify how many computing devices are to be used to perform the data processing operation, and may identify a particular computing device to be used to perform the data processing operation. This may be accomplished in any suitable manner, including by Using any of the layout assignment techniques described in U.S. patent application No. 15/939,829 entitled "Systems and Methods for Performing Data Processing Operations Using Variable Level parallelisms," filed on 29.3.2018, which is hereby incorporated by reference in its entirety. In some embodiments, it may be determined earlier (e.g., during act 207) whether a node is to be processed using a single device or multiple computing devices (e.g., whether parallel processing is to be applied and what level of parallel is to be employed), where the particular computing device to be used for computing is to be allocated at act 218.

After the process layout is assigned at act 218, an updated dataflow graph may be executed. For example, when process 200 is performed by a data processing system, data processing system 100 can use graph execution engine 115 to execute an updated dataflow graph. In some embodiments, the updated dataflow graph generated at act 207 may execute immediately upon generation without any user input. In other embodiments, an updated dataflow graph may be generated at act 207, but its execution may only begin in response to a command to do so, which may be provided by a user through an interface or by another computer program (e.g., through an API call).

It should be understood that process 200 is illustrative and that variations exist. For example, in some embodiments, optional acts 218 and 220 may be omitted, and process 200 may be completed after the updated dataflow graph is generated and stored. As another example, process 200 may be used to optimize a dataflow graph that is provided from another source (e.g., another data processing system), rather than optimizing a dataflow graph that is generated from an input SQL query as is the case in the illustrated embodiment. In such embodiments, acts 202 through 204 may be omitted, and the initial dataflow graph may be generated from a dataflow graph provided by another source at act 206. Such generation may involve transforming the received dataflow graph into a dataflow graph that is configured for use with a data processing system that performs the process 200.

Illustrative examples of applying optimization rules to a data flow graph are provided below with reference to fig. 3A-9. Each of the dataflow graphs shown in these figures may be a sub-graph of a larger dataflow graph that is being optimized (e.g., as part of act 207 of process 200). For example, each of the one or more dataflow graphs shown in these figures may be a subgraph of the initial dataflow graph generated at act 206 of process 200, and/or a subgraph of one or more dataflow graphs that is obtained by transforming the initial dataflow graph as part of act 207.

FIG. 3A shows the application of optimization rules to an illustrative data flow diagram 300 to remove one or more redundant data processing operations. As shown in FIG. 3A, the dataflow graph 300 includes a node 302 that represents repartitioning data processing operations (which partitions data for parallel processing on different computing devices), followed by a node 304 that represents serialization operations (which operates to combine all data for serial processing by a single computing device). No repartitioning data processing operations are required since subsequent serialization commands will invalidate the repartitioning effect. Thus, a dataflow graph optimizer (e.g., graph optimizer 110 in data processing system 100) may remove nodes 302 that represent repartitioning operations. As a result, portion 300 of the dataflow graph is transformed into portion 305.

Fig. 3B illustrates changing the order of exchanging data processing operations to facilitate applying optimization rules to another illustrative data flow diagram in accordance with some embodiments of the technology described herein. As shown in fig. 3B, the dataflow graph 310 includes a node 311 representing an ordered data processing operation (the ordering being performed according to key a), followed by one or more nodes (not shown), followed by a node 312 representing another ordered data processing operation (the ordering being performed according to key B). In this example, determining whether the ordering operation represented by node 311 is exchanged with the data processing operation(s) represented by the node between nodes 311 and 312 may facilitate applying one or more optimization rules to dataflow graph 310. For example, if the ordering operation represented by node 311 is swapped with the data processing operation(s) represented by a node between nodes 311 and 312, changing the order of the nodes allows nodes 311 and 312 to be placed adjacent to each other (as shown in data flow diagram 313). In turn, this allows consideration of whether one or more optimization rules may be applied to the generated graph 313 (which may not apply to the dataflow graph 310 before changing the order of the locations in the graph where the sort operations represented by the nodes 311 appear). For example, in this case, subsequent reordering according to key B (node 312) would negate the effect of the ordering according to key A (node 311). Therefore, the sorting operation represented by node 311 is unnecessary, and the node may be removed (as shown in FIG. 3C), resulting in dataflow graph 314.

FIG. 3D shows the application of optimization rules to another illustrative data flow diagram 320 to remove redundant data processing operations. As shown in FIG. 3D, the data flow diagram 320 includes a node 322 representing the sort operation according to key A. However, when the data has been sorted according to the same key, the dataflow graph optimizer may remove the node 322 representing the sort operation according to key a, resulting in the dataflow graph 323. In all of the examples shown in fig. 3A-3D, removing redundant nodes results in a dataflow graph that consumes fewer computing resources when executed than if such nodes were not removed.

FIG. 4A shows the application of optimization rules to an illustrative dataflow graph 400 to perform a strength reduction optimization and obtain a dataflow graph 402. As shown in FIG. 4A, dataflow graph 400 contains a node 401 that represents a sort operation for sorting incoming data according to primary key A (e.g., sorting by last name) and then sorting according to secondary keys (e.g., sorting by first name people having the same last name). However, when the graph optimizer detects that the data in the ingress node 401 has been sorted according to key a (e.g., has been sorted by last name), the graph optimizer may perform a strength-reduction optimization by replacing the sorting operations with the in-group sorting operations (represented by node 403 in graph 402), thereby reducing the strength of the sorting operations but obtaining the same result and avoiding performing unnecessary computations.

FIG. 4B shows the application of optimization rules to illustrative dataflow graph 410 to perform another strength reduction optimization and obtain dataflow graph 412. As shown in FIG. 4B, the data flow diagram 410 contains a node 411 representing a scroll operation to be performed according to a primary key A and a secondary key B. However, when the graph optimizer detects that the data in the ingress node 411 has been sorted according to key a, the graph optimizer may perform a strength reduction optimization by replacing the roll-up operation with a packet roll-up operation (represented by node 413 in graph 412), thereby reducing the strength of the roll-up operation but obtaining the same result and avoiding performing unnecessary calculations.

FIG. 5A shows the application of optimization rules to an illustrative dataflow graph 500 for combinatorial operation optimization. As shown in fig. 5A, the data flow diagram 500 includes a sequence of nodes 502, 504, and 506, each representing a respective computation. In some embodiments, during execution of the dataflow graph, data processing operations represented by separate nodes may be performed by different processes running on one or more computing devices. The graph optimizer may perform combined operation optimization by replacing a sequence of three nodes with a single node (e.g., node 508 in dataflow graph 505) so that all operations are performed by a single process executing on a single computing device, which reduces the overhead of inter-process (and potentially inter-device) communication.

FIG. 5B shows the application of optimization rules to another illustrative dataflow graph 510 for combinatorial operation optimization. As shown in FIG. 5B, the dataflow graph 510 includes a node 512 representing a join operation on data set A with data set B using keys A1 and B1, followed by a node 514 representing a join operation on the output of the join operation represented by the node 512 with data set C using keys A1 and C1. In this example, the graph optimizer may perform the combined operation optimization by: two separate nodes representing respective join operations are replaced with a single node representing join operations on datasets A, B and C using keys a1, B1, and C1 (as shown by node 516 in dataflow graph 515). In this manner, the join process is performed by a single process executing on a single computing device, which reduces the overhead of inter-process (and potentially inter-device) communications.

FIG. 5C shows the application of optimization rules to another illustrative dataflow graph 520 for combinatorial operation optimization. As shown in FIG. 5C, the dataflow graph 520 includes a node 522 that represents a filtering operation performed according to key A; followed by node 524 representing another filtering operation according to key B. In this example, the graph optimizer may perform the combined operation optimization by: the two separate nodes 522 and 524 representing the respective filtering operations are replaced with a single node representing the filtering operation that filters according to both keys a and B (as shown by node 526 in the data flow graph 525). In this manner, filtering is performed by a single process executing on a single computing device, which reduces the overhead of inter-process (and potentially inter-device) communication.

FIG. 5D shows the application of optimization rules to another illustrative dataflow graph 530 for combinatorial operation optimization. As shown in FIG. 5D, the dataflow graph 530 includes a node 532 representing a filter operation according to key A, followed by a node 534 representing a scroll operation. In this example, the graph optimizer may perform the combined operation optimization by: the two separate nodes 522 and 524 are replaced with a single node (as shown by node 536 in the dataflow graph 535) representing a scroll operation that processes input according to key a selection. In this manner, a single process executing on a single device may perform the equivalent of a roll-up operation and a filtering operation, which reduces the overhead of inter-process (and potentially inter-device) communication.

FIG. 6 shows the application of optimization rules to an illustrative data flow diagram 600 to remove unnecessary data processing operations. As shown in FIG. 6, the dataflow diagram 600 includes a node 602 that represents an ordering operation by key A, followed by a node 604 that represents a reformatting command, followed by a node 606 that represents an out-of-order write command. In the event that the write command is out-of-order and thus the ordering operation represented by node 602 applies to the data the order that is not preserved by the write operation, the graph optimizer may remove the node representing the ordering operation, which results in a dataflow graph 605.

Fig. 7 shows the application of optimization rules to an illustrative data flow diagram 700 for breadth reduction optimization in accordance with some embodiments of the techniques described herein. As shown in FIG. 7, the dataflow graph 700 includes a node 702 that sets the value of data column A to the logical OR of the data stored in columns B and C, followed by a node 704 that represents a filtering operation on data column A provided by the node 702 and a node 706 that represents an ordering operation using key D. After the filtering operation, data column A is not used in downstream calculations. Thus, in some embodiments, the graph optimizer may perform the width reduction optimization by removing data in data column a so that the data is not propagated further (e.g., by introducing nodes for deleting the data from data moving along links between nodes, as shown by node 708 in the dataflow graph 705). This reduces the amount of computing resources required to carry data through subsequent computations downstream of the node 704 in the dataflow graph 705 (e.g., by reducing network, memory, and processing resources utilized).

FIG. 8A shows the application of optimization rules to an illustrative data flow diagram 800 for deduplication optimization. As shown in fig. 8A, the data flow diagram 800 includes nodes 802 and 804, both of which represent read operations from the same base file (a.dat). The graph optimizer may perform deduplication optimization by replacing both nodes with a single node (e.g., as shown by node 806 in dataflow graph 805) so that the data is read only once. Not reading the same data twice reduces the amount of computing resources used to access the data, even if the data is subsequently used by a different part in the dataflow graph 805.

FIG. 8B shows the application of optimization rules to an illustrative dataflow graph 811 to perform deduplication optimization by zipper-wise combining. As shown in fig. 8B, the dataflow graph 811 includes a node 810 that represents a data processing operation that reads data from a file called "a.dat". The data is then processed in two different branches. The first branch contains node 812 (extract data column "A.f" and add "1" to the data in this column) and node 814 (sort the data according to column "A.k"). The second branch contains node 816 (extract the data column "A.g" and multiply the data in that column by 3) and node 818 (sort the data according to the "A.k" column). In the example of FIG. 8A, deduplication involves deleting nodes that do the same processing (read the same data from the same file). However, in the example of fig. 8B, the processing performed by the different branches of the diagram 811 is different. However, the processing performed in the different branches is similar enough that the branches can be cross-folded (fold) together into a single path. This cross-folding can be done by grouping the paths together from left to right (similar to the way zippered sides are joined).

In this example, the graph optimizer may modify the graph 811 such that the computations represented by nodes 812 and 816 are performed serially (rather than in parallel). The result is shown in data flow diagram 821. Next, the graph optimizer may modify the graph 821 such that the nodes 826 and 828 representing respective sort operations are combined into a single node 836. The result is shown in the data flow diagram 831. As can be seen, the number of nodes in the diagram 831 is reduced relative to the diagram 811, and the processing of the same data can be done at the same location, thereby reducing the required computational resources. In addition, further optimization may be applied to the resulting graph 831, for example, by combining the operations represented by nodes 832 and 834 into a single node.

FIG. 9 shows the application of optimization rules to an illustrative data flow diagram 900 for serial-to-parallel optimization. As shown in fig. 9, the dataflow graph 900 includes: a node 902 representing a serialization operation applied to data previously processed on the k computing device(s), and a node 904 representing a sort operation (according to key a). In some embodiments, the dataflow graph optimizer may alter the dataflow graph such that one or more operations applied to serialized data are instead applied in a parallelized manner. In this example, the dataflow graph optimizer may modify the graph 900 to remove serialization operations and allow ordering to be applied in k parallel ways (as shown using node 906 in dataflow graph 905). The results of the k parallel orderings may then be combined using a merge operation, as shown using node 908 in dataflow graph 905.

Fig. 10A illustrates an initial data flow diagram 1000 in accordance with some embodiments of the technology described herein. Fig. 10B illustrates an updated data flow diagram 1050 obtained by iteratively applying optimization rules to the initial data flow diagram shown in fig. 10A in accordance with some embodiments of the techniques described herein. As can be seen by comparing the initial data flow graph 1000 of fig. 10A with the updated data flow graph 1050 of fig. 10B, the updated data flow graph has fewer nodes and links than the initial data flow graph and can be performed more efficiently than the initial data flow graph. A number of optimizations applied to the initial dataflow graph 1000 in accordance with some embodiments of the techniques described herein are described in detail below.

In the illustrative example of fig. 10A and 10B, multiple combining operation optimizations are applied to the initial dataflow graph 1000 of fig. 10A to generate the updated dataflow graph 1050 of fig. 10B. For example, during generation of the updated dataflow graph 1050, a data processing operation represented by node 1002 ("expr," which may be an expression for performing any suitable computation on data flowing through node 1002, and in this case formatting the data for output) is combined with a write data processing operation represented by node 1004 ("write file") using a combined operation optimization, which contains node 1054 configured to perform both data processing operations. As another example, during generation of the updated dataflow graph 1050, which contains a node 1056 configured to perform both data processing operations ("expr") represented by node 1006 and a filtered data processing operation ("filter") represented by node 1008 are combined using a combined operation optimization. As yet another example, during generation of the updated dataflow graph 1050, which contains a node 1058 configured to perform both data processing operations ("write files") represented by node 1010 and a write data processing operation ("write file") represented by node 1012 are combined using a combined operation optimization. As discussed herein, in some embodiments, because data processing operations associated with a single dataflow graph node are performed by a single process executing on a single computing device, combining nodes reduces the overhead of inter-process (and potentially inter-device) communication, thereby improving the performance of the data processing system.

In the illustrative example of fig. 10A and 10B, the optimization for reducing redundancy is applied multiple times to the initial data flow graph 1000 of fig. 10A to generate the updated data flow graph 1050 of fig. 10B. For example, in the updated data flow graph 1050, the sorting operations represented by nodes 1030, 1032, and 1034 in the initial data flow graph 1000 are removed and there are no corresponding entries. These sorting operations are removed because the data passed into these nodes has been sorted by the sorting operations applied via nodes 1022, 1026, and 1028, respectively. The sort order imposed via processing at nodes 1022, 1026, and 1028 may be preserved by subsequent data processing operations (e.g., sort volume operations), and the sort order of the data is maintained such that no further sorting at nodes 1030, 1032, and 1034 is required. Thus, nodes 1030, 1032, and 1034 are removed during generation of the updated data flow graph 1050. On the other hand, the sort operations represented by nodes 1022, 1026, and 1028 are retained. For example, nodes 1022 and 1026 in the initial dataflow graph 1000 correspond to nodes 1070 and 1072 in the updated dataflow graph 1050. In contrast, the sort operation represented by node 1036 is not redundant since the sort order was not preserved in the previous all-external merge-join operation (node 1074 is the corresponding node in the updated data flow graph 1050).

Fig. 10A and 10B also show another example of removing unnecessary data processing operations. As shown in FIG. 10A, the initial dataflow graph 1000 includes a node 1024 representing an ordering operation, followed finally by a node 1004 representing an out-of-order write command. In the event that the write command is out of order and therefore the ordering operation represented by node 1024 applies to the data that order is not preserved by the write operation, node 1024 representing the ordering operation may be removed. As can be seen from FIG. 10B, there is no sort operation corresponding to node 1024 — this operation has been removed.

As yet another example of removing unnecessary data processing operations, the initial dataflow graph 1000 includes layout and partitioning operations represented by nodes 1040, 1045, 1046, and 1047, which are redundant in that there are corresponding layout and partitioning nodes before each of them, and the layout is not otherwise changed. Thus, these nodes are removed from the initial dataflow graph 1000 — there are no corresponding nodes in the dataflow graph 1050. In the relevant optimization applied to the initial dataflow graph 1000, the layout and partitioning operations represented by node 1048 are replaced with aggregation operations represented by node 1088. In contrast, the layout and partition data processing operations represented by node 1044 are retained because the node is not redundant; this node is responsible for establishing the final partitioning and layout, and generating node 1080 "partition by key" in final graph 1050.

Fig. 10A and 10B also illustrate examples of width reduction optimization. As can be seen, by introducing the "reformat _ implicit" node 1064 shown in the updated data flow diagram 1050 of fig. 10B, the number of columns of data processed using the read data processing operation represented by node 1014 is reduced. In this example, the graph optimizer detects a number of data columns in the read data that will not be used subsequently by determining which columns the write operation represented by node 1012 is to write out.

Fig. 10C illustrates another view of the initial data flow diagram 1000 of fig. 10A, in accordance with some embodiments of the technology described herein. In this view, the labels of the dataflow graph nodes have been replaced by their acronyms. Similarly, FIG. 10D illustrates another view of the updated dataflow graph of FIG. 10B in which the labels of the dataflow graph nodes have been replaced with their acronyms.

FIG. 11 illustrates an example of a suitable computing system environment 1100 on which the techniques described herein may be implemented. The computing system environment 1100 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the techniques described herein. Neither should the computing environment 1100 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 1100.

The technology described herein is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the techniques described herein include, but are not limited to: personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The computing environment may execute computer-executable instructions, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The techniques described herein may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

With reference to FIG. 11, an exemplary system for implementing the techniques described herein includes a general purpose computing device in the form of a computer 1110. Components of computer 1110 may include, but are not limited to, a processing unit 1120, a system memory 1130, and a system bus 1121 that couples various system components including the system memory to the processing unit 1120. The system bus 1121 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.

Computer 1110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 1110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computer 1010. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.

The system memory 1130 includes computer storage media in the form of volatile and/or nonvolatile memory such as Read Only Memory (ROM)1131 and Random Access Memory (RAM) 1132. A basic input/output system 1133(BIOS), containing the basic routines that help to transfer information between elements within the computer 1110, such as during start-up, is typically stored in ROM 1131. RAM 1132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 1120. By way of example, and not limitation, FIG. 11 illustrates operating system 1134, application programs 1135, other program modules 1036, and program data 1137.

The computer 1110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, FIG. 11 illustrates a hard disk drive 1141 that reads from or writes to non-removable, nonvolatile magnetic media, a flash memory drive 1151 that reads from or writes to removable, nonvolatile memory 1152 (such as flash memory), and an optical disk drive 1155 that reads from or writes to a removable, nonvolatile optical disk 1156 (such as a CD ROM or other optical media). Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 1141 is typically connected to the system bus 1121 through a non-removable memory interface such as interface 1140, and magnetic disk drive 1151 and optical disk drive 1155 are typically connected to the system bus 1121 by a removable memory interface, such as interface 1150.

The drives and their associated computer storage media discussed above and illustrated in FIG. 11, provide storage of computer readable instructions, data structures, program modules and other data for the computer 1110. In FIG. 11, for example, hard disk drive 1141 is illustrated as storing operating system 1144, application programs 1145, other program modules 1146, and program data 1147. Note that these components can either be the same as or different from operating system 1134, application programs 1135, other program modules 1136, and program data 1137. Operating system 1144, application programs 1145, other program modules 1146, and program data 1147 are given different numbers here to illustrate that, at a minimum, they are different copies. A user may enter commands and information into the computer 1110 through input devices such as a keyboard 1162 and pointing device 1161, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 1120 through a user input interface 1160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a Universal Serial Bus (USB). A monitor 1191 or other type of display device is also connected to the system bus 1121 via an interface, such as a video interface 1190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 1197 and printer 1196, which may be connected through an output peripheral interface 1195.

The computer 1110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 1180. The remote computer 1180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 1110, although only a memory storage device 1181 has been illustrated in fig. 11. The logical connections depicted in FIG. 11 include a Local Area Network (LAN)1171 and a Wide Area Network (WAN)1173, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 1110 is connected to the LAN 1171 through a network interface or adapter 1170. When used in a WAN networking environment, the computer 1110 typically includes a modem 1172 or other means for establishing communications over the WAN 1173, such as the Internet. The modem 1172, which may be internal or external, may be connected to the system bus 1121 via the user input interface 1160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 1110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, fig. 11 illustrates remote application programs 1185 as residing on memory device 1181. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

Having thus described several aspects of at least one embodiment of this invention, it is to be appreciated various alterations, modifications, and improvements will readily occur to those skilled in the art.

Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the spirit and scope of the invention. Moreover, while advantages of the invention are indicated, it should be understood that not every embodiment of the technology described herein will include every described advantage. Some embodiments may not implement any features described as advantageous herein, and in some cases, one or more of the described features may be implemented to implement further embodiments. Accordingly, the foregoing description and drawings are by way of example only.

The above-described embodiments of the techniques described herein may be implemented in any of a variety of ways. For example, the embodiments may be implemented using hardware, software, or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether disposed in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits (with one or more processors in one integrated circuit component), including commercially available integrated circuit components known in the art as CPUs, GPU chips, microprocessors, microcontrollers or co-processors. Alternatively, the processor may be implemented in custom circuitry (such as an ASIC) or semi-custom circuitry produced by configuring a programmable logic device. As yet another alternative, the processor may be part of a larger circuit or semiconductor device, whether commercially available, semi-custom, or custom. As a specific example, some commercial microprocessors have multiple cores, such that one or a subset of the cores may make up the processor. However, the processor may be implemented using circuitry in any suitable format.

Further, it should be appreciated that a computer may be implemented in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer. Additionally, a computer may be embedded in a device not normally considered a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone, or any other suitable portable or stationary electronic device.

Also, a computer may have one or more input devices and output devices. These devices may be used, inter alia, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that may be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible format.

Such computers may be interconnected by one or more networks in any suitable form, including as a local area network or a wide area network, such as an enterprise network or the internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol, and may include wireless networks, wired networks, or fiber optic networks.

Also, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.

In this regard, the invention may be implemented as a computer-readable storage medium (or multiple computer-readable media) (e.g., a computer memory, one or more floppy disks, Compact Disks (CDs), optical disks, Digital Video Disks (DVDs), tapes, flash memories, circuit configurations in field programmable gate arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various embodiments of the invention discussed above. As will be apparent from the foregoing examples, a computer-readable storage medium may retain information for a sufficient time to provide computer-executable instructions in a non-transitory form. Such computer-readable storage media may be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the present invention as discussed above. As used herein, the term "computer-readable storage medium" encompasses only a non-transitory computer-readable medium that can be considered an article of manufacture (i.e., an article of manufacture) or a machine. Alternatively or additionally, the invention may be embodied as a computer readable medium other than a computer readable storage medium, such as a propagated signal.

The terms "program" or "software" are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of the present invention as discussed above. In addition, it should be appreciated that according to one aspect of this embodiment, one or more computer programs that when executed perform methods of the present invention need not reside on a single computer or processor, but may be distributed in a modular fashion amongst a number of different computers or processors to implement various aspects of the present invention.

Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.

Furthermore, the data structures may be stored in any suitable form on a computer readable medium. For simplicity of illustration, the data structure may be shown with fields that are related by location in the data structure. Such relationships may likewise be implemented by assigning a storage for the fields a location in a computer-readable medium that conveys the relationship between the fields. However, any suitable mechanism may be used to establish relationships between information in fields of a data structure, including through the use of pointers, tags, or other mechanisms that establish relationships between data elements.

The various aspects of the present invention may be used alone, in combination, or in a variety of arrangements not specifically discussed in the embodiments described in the foregoing and is therefore not limited in its application to the details and arrangement of components set forth in the foregoing description or illustrated in the drawings. For example, aspects described in one embodiment may be combined in any manner with aspects described in other embodiments.

Furthermore, the present invention may be embodied as a method, an example of which has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than presented, which may include performing some acts concurrently, even though shown as sequential acts in illustrative embodiments.

Further, some actions are described as being made by a "user". It should be understood that a "user" need not be a single individual, and in some embodiments, actions attributable to a "user" may be performed by a team of individuals and/or a combination of individuals and computer-aided tools or other mechanisms.

Use of ordinal terms such as "first," "second," "third," etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.

Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising," or "having," "containing," "involving," and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.

54页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:动力机械的特征的远程启动

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!