Using object models of heterogeneous data to facilitate building data visualizations

文档序号:1382593 发布日期:2020-08-14 浏览:7次 中文

阅读说明:本技术 使用异构数据的对象模型来便于建立数据可视化 (Using object models of heterogeneous data to facilitate building data visualizations ) 是由 贾斯汀·塔尔博特 罗杰·哈乌 丹尼尔·科里 吴芝英 特里萨·罗伯茨 于 2018-08-01 设计创作,主要内容包括:过程接收指定数据源、视觉变量、和来自数据源的数据字段的视觉规范。每个视觉变量与一个或更多个数据字段相关联,并且每个数据字段是维度或度量。对于每个度量m,该过程识别由根据在数据源的对象模型中的多对一关系的序列从度量可达的维度组成的集合R(m)。对于每个不同的集合R,该过程形成数据字段集合S,数据字段集合S由在R中的每个维度和满足R(m)=R的每个度量m组成。对于每个集合S以及对于在集合S中的每个度量m,该过程根据在S中的维度来聚合度量的值。该过程根据在集合S中的数据字段和它们所关联到的视觉变量来建立数据可视化。(The process receives a visual specification specifying a data source, visual variables, and data fields from the data source. Each visual variable is associated with one or more data fields, and each data field is a dimension or metric. For each metric m, the process identifies a set r (m) consisting of dimensions reachable from the metric according to a sequence of many-to-one relationships in the object model of the data source. For each different set R, the process forms a set S of data fields consisting of each dimension in R and each metric m that satisfies R (m) ═ R. For each set S and for each metric m in set S, the process aggregates the values of the metrics according to the dimensions in S. The process builds a data visualization from the data fields in the set S and the visual variables to which they are associated.)

1. A method for generating a data visualization, comprising:

at a computer having one or more processors and memory storing one or more programs configured for execution by the one or more processors:

receiving a visual specification specifying one or more data sources, a plurality of visual variables, and a plurality of data fields from the one or more data sources, wherein each visual variable is associated with a respective one or more of the data fields, and each of the data fields is identified as a dimension d or a metric m;

for each measure m of the data field, identifying a respective set of reachable dimensions R (m) consisting of all dimensions d of the data field reachable from the respective measure m according to a sequence of many-to-one relationships in a predefined object model of the one or more data sources;

for each different set of reachable dimensions R, forming a respective set S of data fields of the data fields, wherein S consists of each dimension in R and each metric m in the data fields that satisfies R (m) ═ R; and

for each set of data fields S:

for each metric m in the respective set of data fields S, rolling up the value of the metric m to a level of detail specified by the respective dimension in the respective set of data fields S; and

establishing respective data visualizations according to data fields in the respective sets of data fields S and according to the respective visual variable to which each of the data fields in S is associated.

2. The method of claim 1, wherein the visual specification further comprises one or more additional visual variables not associated with any data fields from the one or more data sources.

3. The method of claim 1, wherein establishing the respective data visualizations further comprises retrieving tuples of data from the one or more data sources using one or more database queries generated from the visual specification.

4. The method of claim 3, wherein the tuple comprises data aggregated according to the respective dimension in the respective set S of data fields.

5. The method of claim 3, further comprising displaying the respective data visualization in a graphical user interface of the computer.

6. The method of claim 5, wherein displaying the data visualization includes generating a plurality of visual markers, each marker corresponding to a respective tuple retrieved from the one or more data sources.

7. The method of claim 5, wherein the graphical user interface includes a data visualization area, the method further comprising displaying the data visualization in the data visualization area.

8. The method of claim 1, wherein each of the visual variables is selected from the group consisting of: row attributes, column attributes, filter attributes, color coding, size coding, shape coding, and label coding.

9. The method of claim 1, wherein dimension d and metric m are in the same class in a predefined object model, or the metric m is a first class C in the predefined object model1The dimension d is the nth class in the object modelCnAnd n ≧ 2, and zero or more intermediate classes C exist in the predefined object model2,…,Cn-1Such that for each i ═ 1, 2, …, n-1, in class CiAnd Ci+1Where there is a many-to-one relationship between, the dimension d is reachable from the metric m.

10. The method of claim 1, wherein the scrolling the value of the metric m to a level of detail specified by the respective dimension in the respective set of data fields S comprises dividing rows of a data table containing the metric m into groups according to the respective dimension in the respective set of data fields S, and calculating a single aggregate value for each group.

11. The method of claim 10, wherein calculating the single aggregate value comprises applying an aggregation function selected from the group consisting of: SUM, COUNT, COUNTD, MIN, MAX, AVG, MEDIAN, ATTR, PERCENTILE, STDEV, STDEVP, VAR, and VARP.

12. A computer system for generating a data visualization, comprising:

one or more processors; and

a memory;

wherein the memory stores one or more programs configured for execution by the one or more processors, and the one or more programs include instructions for:

receiving a visual specification specifying one or more data sources, a plurality of visual variables, and a plurality of data fields from the one or more data sources, wherein each visual variable is associated with a respective one or more of the data fields, and each of the data fields is identified as a dimension d or a metric m;

for each measure m of the data field, identifying a respective set of reachable dimensions R (m) consisting of all dimensions d of the data field reachable from the respective measure m according to a sequence of many-to-one relationships in a predefined object model of the one or more data sources;

for each different set of reachable dimensions R, forming a respective set S of data fields of the data fields, wherein S consists of each dimension in R and each metric m in the data fields that satisfies R (m) ═ R; and

for each set of data fields S:

for each metric m in the respective set of data fields S, rolling up the value of the metric m to a level of detail specified by the respective dimension in the respective set of data fields S; and

establishing respective data visualizations according to data fields in said respective sets of data fields S and according to respective visual variables to which each of said data fields in S is associated.

13. The computer system of claim 12, wherein establishing the respective data visualizations further comprises retrieving tuples of data from the one or more data sources using one or more database queries generated from the visual specification.

14. The computer system of claim 13, wherein the tuple comprises data aggregated according to the respective dimension in the respective set of data fields, S.

15. The computer system of claim 13, wherein the one or more programs further comprise instructions for displaying the respective data visualizations in a graphical user interface of the computer.

16. The computer system of claim 15, wherein displaying the data visualization includes generating a plurality of visual markers, each marker corresponding to a respective tuple retrieved from the one or more data sources.

17. The computer system of claim 12, wherein each of the visual variables is selected from the group consisting of: row attributes, column attributes, filter attributes, color coding, size coding, shape coding, and label coding.

18. The computer system of claim 12, wherein when dimension d and metric m are in the same class in the predefined object model, or, the metric m is a first class C in the predefined object model1The dimension d is the nth class C in the object modelnAnd n ≧ 2, and zero or more intermediate classes C exist in the predefined object model2,…,Cn-1Such that for each i ═ 1, 2, …, n-1, in class CiAnd Ci+1Where there is a many-to-one relationship between, the dimension d is reachable from the metric m.

19. The computer system of claim 12, wherein to roll up the value of the metric m to a level of detail specified by the respective dimension in the respective set of data fields S comprises to divide rows of a data table containing the metric m into groups according to the respective dimension in the respective set of data fields S, and to calculate a single aggregate value for each group.

20. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computer system having a display, one or more processors, and memory, the one or more programs including instructions for:

receiving a visual specification specifying one or more data sources, a plurality of visual variables, and a plurality of data fields from the one or more data sources, wherein each visual variable is associated with a respective one or more of the data fields, and each of the data fields is identified as a dimension d or a metric m;

for each measure m of the data field, identifying a respective set of reachable dimensions R (m) consisting of all dimensions d of the data field reachable from the respective measure m according to a sequence of many-to-one relationships in a predefined object model of the one or more data sources;

for each different set of reachable dimensions R, forming a respective set S of data fields of the data fields, wherein S consists of each dimension in R and each metric m in the data fields that satisfies R (m) ═ R; and

for each set of data fields S:

for each metric m in the respective set of data fields S, rolling up the value of the metric m to a level of detail specified by the respective dimension in the respective set of data fields S; and

establishing respective data visualizations according to the data fields in the respective sets of data fields S and according to the respective visual variable to which each of the data fields in S is associated.

Technical Field

The disclosed implementations relate generally to data visualization and, more particularly, to interactive visual analysis of a data set using an object model of the data set.

Background

Data visualization applications enable users to visually understand data sets, including distributions, trends, outliers, and other factors important to making business decisions. Some data elements must be calculated based on data from the selected data set. For example, data visualization often uses sums to aggregate data. Some data visualization applications enable a user to specify a "level of detail" (LOD) that is available for aggregation calculations. However, specifying a single level of detail for data visualization is not sufficient to build some computations.

Some data visualization applications provide a user interface that enables a user to establish visualizations from data sources by selecting data fields and placing them within a particular user interface area to indirectly define the data visualizations. See, for example, U.S. patent application Ser. No. 10/453,834 entitled Computer Systems and Methods for the Query and visualization of Multidimensional Databases, filed on 2.6.2003, which is now U.S. patent No. 7,089,266, incorporated herein by reference in its entirety. However, when there are complex data sources and/or multiple data sources, it may not be clear what type of data visualization, if any, is generated based on the user's selection.

SUMMARY

Generating data visualizations that combine data from multiple tables can be challenging, especially when there are multiple fact tables. In some cases, it may be beneficial to build an object model of the data prior to generating a visualization of the data. In some instances, a person is a particular expert on data, and the person creates an object model. By storing relationships in the object model, the data visualization application can leverage this information to help all users accessing the data, even if the user is not an expert.

An object is a collection of named attributes. Often, the object corresponds to a real-world object, event, or concept, such as a store. An attribute is a description of an object, which is conceptually in a 1:1 relationship to the object. Thus, a store object may have a single [ manager name ] or [ employee count ] associated with it. At the physical level, objects are often stored as rows in a relational table or as objects in JSON.

A class is a collection of objects that share the same properties. Comparing objects in a class and aggregating them must be analytically meaningful. On a physical level, classes are often stored as relational tables or as arrays of objects in JSON.

An object model is a collection of classes and a collection of many-to-one relationships between them. Classes that are related by a 1-to-1 relationship are conceptually treated as a single class, even though they are different in the sense of a user. Furthermore, classes that are related through a 1-to-1 relationship may be presented as distinct classes in a data visualization user interface. Many-to-many relationships are conceptually separated into two many-to-one relationships by adding an association table that captures the relationships.

Once the class model is built, the data visualization application can assist the user in various ways. In some implementations, based on data fields that have been selected and placed on a tool rack (shelf) in the user interface, the data visualization application can recommend additional fields or limit what actions can be taken to prevent combinations that are not available. In some implementations, the data visualization application allows the user considerable freedom in selecting fields, and uses the object model to build one or more data visualizations according to the content selected by the user.

According to some implementations, a process generates a data visualization. The process is performed at a computer having one or more processors and memory storing one or more programs configured for execution by the one or more processors. The process receives a visual specification specifying one or more data sources, a plurality of visual variables, and a plurality of data fields from the one or more data sources. Each visual variable is associated with one or more data fields, and each data field is identified as a dimension d or a metric m. In some implementations, the visual specification is a data structure that is populated based on user selections in the user interface. For example, a user may drag a field from a palette of data fields to a row toolhead, column toolhead, or encoding toolhead (e.g., color or size encoding). Each tool rack corresponds to a visual variable in the visual specification, and the data fields on the tool racks are stored as part of the visual specification. In some instances, there are two or more data fields associated with the same toolhead, and thus a corresponding visual variable has two or more related data fields. When there are two or more data fields associated with a visual variable, there is typically a specified order. In some instances, the same data field is associated with two or more different visual variables. Typically, individual data visualizations do not use all available visual variables. That is, the visual specification typically includes one or more additional visual variables that are not associated with any data fields from one or more data sources. In some implementations, each visual variable is one of: a row attribute, a column attribute, a filter attribute, a color code, a size code, a shape code, or a label code.

In many cases, the metric is a numeric field and the dimension is a data field having a string data type. More importantly, the labels "measure" and "dimension" indicate how the data field is used.

For each metric m of the data field, the process identifies a respective set of reachable dimensions r (m) consisting of all dimensions d of the data field reachable from the respective metric m according to a sequence of many-to-one relationships in a predefined object model for the one or more data sources. Note that a sequence may have a length of zero, representing the case where dimension d and metric m are in the same class. In some implementations, when dimension d and metric m are in the same class in the predefined object model, or metric m is a first class C in the predefined object model1D is the nth class C in the object modelnAnd n ≧ 2, and zero or more intermediate classes C exist in the predefined object model2,…,Cn-1Such that for each i ═ 1, 2, …, n-1, in class CiAnd Ci+1With many-to-one relationships between, the dimension d is reachable from the metric m.

Note that there are also trivial cases whereBecause there are no dimensions associated with the visual variables, or some metrics that cannot reach any dimension. This is an efficient set of reachable dimensions.

Establishing a set of reachable dimensions results in a partitioning of the metrics. In particular, the compound is prepared by reacting and only reacting R (m)1)=R(m2) Time m1~m2The defined relationships are equivalence relationships. In most cases, there is only one partition (i.e., r (m)) that is the same for all metrics, but in some instances there is more than one partition.

For each different set of reachable dimensions R, the process forms a corresponding set of data fields S. The set S consists of each dimension in R and each metric m in the data field that satisfies R (m) ═ R. Typically, each set of data fields includes at least one metric. In some implementations, any set of data fields without metrics is ignored. In some implementations, the data visualization application causes an error when a set S of data fields without a metric is identified. In some implementations, the data visualization application (in addition to the data visualization created for each set of data fields S that includes one or more metrics) establishes additional data visualizations for each set of data fields S that does not have a metric.

For each set S of data fields and for each metric m in the respective set S of data fields, the process rolls up the value of metric m to the level of detail specified by the respective dimension in the respective set S of data fields. The process then builds a respective visualization of the data from the data fields in the respective set of data fields S and from the respective visual variable with which each data field in S is associated.

In some implementations, establishing respective data visualizations includes retrieving tuples of data from one or more data sources using one or more database queries generated according to a visual specification. For example, for an SQL data source, the process builds an SQL query and sends the query to the appropriate SQL database engine. In some examples, the tuples include data aggregated according to respective dimensions in respective sets S of data fields. That is, aggregation is performed by the data sources.

Typically, the generated data visualization is displayed in a graphical user interface on the computer (e.g., a user interface for a data visualization application). In some implementations, displaying the data visualization includes generating a plurality of visual markers, where each marker corresponds to a respective tuple retrieved from one or more data sources. In some implementations, the graphical user interface includes a data visualization area, and the process displays the data visualization in the data visualization area.

In some implementations, the scrolling up the value of the metric m to the level of detail specified by the respective dimension in the respective set S of data fields includes dividing the rows of the data table containing the metric m into groups according to the respective dimension in the respective set S of data fields, and calculating a single aggregate value for each group.

In some implementations, a single aggregation value is calculated using one of the following aggregation functions: SUM, COUNT (COUNT of different elements), MIN, MAX, AVG (mean average), media, STDEV (standard deviation), VAR (variance), PERCENTILE (e.g., quartile), ATTR, STDEVP, and VARP. In some implementations, the ATTR () aggregation operator returns the value of the expression when it has a single value for all rows, and an asterisk otherwise. In some implementations, the STDEVP and VARP aggregation operators return values based on a biased or entire population. Some implementations include more or different aggregation operators than those listed here. Some implementations use alternate names for the aggregation operator.

In some implementations, data fields are classified as "dimensions" or "metrics" based on how they are used. The dimensions divide the data set, while the metrics aggregate the data in each division. From the SQL thinking trend, the dimensions are elements in the GROUPBY clause and the metrics are elements in the SELECT clause. Typically, discrete classification data (e.g., fields containing state, region, or product name) is used for partitioning, while continuous numerical data (e.g., profit or sales) is used for aggregation (e.g., calculating a sum). However, all types of data fields may be used as dimensions or metrics. For example, a discrete classification field containing a product name may be used as a metric by applying the aggregation function count (non-repeat count). On the other hand, digital data representing a person's height may be used as a dimension, dividing the person by height or by a range of heights. Some aggregation functions (e.g., SUM) may be applied only to digital data. In some implementations, the application assigns a default role (dimension or metric) to each field based on the original data type of the field, but allows the user to override the role. For example, some applications assign a default role of "dimension" to a classification (string) data field and a default role of "measure" to a numeric field. In some implementations, date fields are used by default as dimensions, as they are typically used to divide data into date ranges.

The classification of the dimensions or metrics also applies to the computed expression. For example, an expression such as YEAR ([ purchase date ] is typically used as a dimension, dividing the underlying data into YEARs As another example, consider a data source that includes a product code field (as a string of characters). if the first three characters of the product code encode a product type, then the expression LEFT ([ product code ],3) may be used as a latitude to divide the data into product types.

Some implementations enable a user to specify multiple levels of detail using an interactive graphical user interface. Some examples use two levels of detail, but implementations typically allow an unlimited number of levels of detail. In some instances, data computed from aggregations at one detail level is used in a second aggregation at a second detail level. In some implementations, the data visualization includes a "visualization level of detail" that is used by default to compute the aggregation. This is the level of detail that is visible in the final data visualization. Implementations also provide level of detail expressions that allow a user to specify particular levels of detail in particular contexts.

Some implementations provide for determining a toolhead area of a feature for which data visualization is desired. For example, some implementations include a row toolhead region and a column toolhead region. The user drops the field name into these toolhead regions (e.g., by dragging the field from the schema region), and the field name defines the data visualization features. For example, the user may select a bar graph with columns of each different value of the field placed in the column toolhead region. The height of each bar is defined by another field placed within the row tool holder area.

According to some implementations, a method of generating and displaying a data visualization is performed at a computer. The computer has a display, one or more processors, and memory storing one or more programs configured for execution by the one or more processors. The process displays a graphical user interface on a display. The graphical user interface includes a schema information area that includes a plurality of fields from a database. The process receives user input in a graphical user interface to specify a first aggregation. The specification of the first aggregation groups data through a first set of one or more of the plurality of fields and identifies a first aggregated output field created through the first aggregation. The process also receives user input in the graphical user interface to specify the second aggregation. In some examples, the specification of the second aggregation refers to the first aggregation. The second aggregation groups the data through a second set of one or more fields. The second set of fields is selected from the plurality of fields and the first aggregated output field. The second set of fields is different from the first set of fields. The process establishes a visual specification based on the first aggregated specification and the second aggregated specification.

In some implementations, the process includes retrieving tuples of data from a database using one or more database queries generated from the visual specification. In some implementations, the tuple includes data calculated based on the second aggregation. In some implementations, the process includes displaying a data visualization corresponding to the visual specification, where the data visualization includes data based on the second aggregate calculation. In some implementations, the displayed data visualization includes a plurality of visual markers, each marker corresponding to a respective tuple retrieved from the database. In some implementations, the graphical user interface includes a data visualization area, and the process displays the data visualization in the data visualization area.

In some implementations, the graphical user interface includes column toolholders and row toolholders. In some implementations, the process detects a user action that associates one or more first fields of the plurality of fields with a column toolhead and one or more second fields of the plurality of fields with a row toolhead. The process then generates a visual table in the data visualization area according to the user action. The visual table includes one or more panes, where each pane has an x-axis defined based on data of one or more first fields associated with the column toolhead, and each pane has a y-axis defined based on data of one or more second fields associated with the row toolhead. In some implementations, the process receives user input to associate the second aggregation with a column toolhead or a row toolhead.

In some implementations, the process retrieves tuples from the database according to fields associated with the row toolhead and the column toolhead and displays the retrieved tuples as visual markers in the visual table. In some implementations, each operator of the first aggregation and the second aggregation is one of: SUM, COUNT, COUNTD, MIN, MAX, AVG, MEDIAN, ATTR, PERCENTILE, STDEV, STDEVP, VAR, or VARP.

In some examples, the first aggregate output field is used as a dimension and is included in the second set.

In some implementations, the first aggregated output field is used as a metric and the second aggregation applies one of the aggregation operators to the first aggregated output field. For example, in some examples, the second aggregation calculates an average of the values of the first aggregated output field.

In some implementations, the process displays a graphical user interface on a computer display. The graphical user interface includes a mode information region and a data visualization region. The schema information area includes a plurality of field names, where each field name is associated with a data field from a specified database. The data visualization area includes a plurality of tool holder areas that determine characteristics of the data visualization. Each tool rack area is configured to receive user placements of one or more field names from the schema information area. The process establishes a visual specification based on user selection of one or more field names and user placement of each user-selected field name in a respective tool holder area in the data visualization area.

In some implementations, the data visualization includes a dashboard including a plurality of different part data visualizations. The visual specification includes a plurality of part visual specifications, and each part data visualization is based on a respective one of the part visual specifications.

In some implementations, the data visualization features defined by the visual specification include a tag type and zero or more encodings of the tag. In some implementations, the token type is one of: a bar graph, a line graph, a scatter plot, a text table, or a map. In some implementations, the encoding is selected from the group consisting of a label size, a label color, and a label tag.

According to some implementations, a system for generating a data visualization includes one or more processors, memory, and one or more programs stored in the memory. The programs are configured to be executed by one or more processors. The program includes instructions for performing any of the methods described herein.

According to some implementations, a non-transitory computer-readable storage medium stores one or more programs configured for execution by a computer system having one or more processors and memory. The one or more programs include instructions for performing any of the methods described herein.

Accordingly, methods, systems, and graphical user interfaces for interactive visual analysis of data sets are provided.

49页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:改变应答以提供表现丰富的自然语言对话的方法、计算机装置及计算机可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!