System and method for organizing and locating data

文档序号：53616 发布日期：2021-09-28 浏览：13次中文

阅读说明：本技术 用于组织和查找数据的系统和方法 (System and method for organizing and locating data ) 是由 A·布莱于 2020-01-30 设计创作，主要内容包括：一种用于组织、表示、查找、发现和访问数据的系统和相关方法。实施例以称为“特征图”的数据结构的形式表示信息和数据。特征图包括节点和边,其中边用于将节点“连接”到一个或多个其他节点。特征图中的节点可以表示变量,即,可度量的对象、特征或因素。特征图中的边可以表示节点与从一个或多个源检索到的一个或多个其他节点之间的统计关联的度量。证明或支持统计关联或度量关联变量的数据集将从特征图中“链接到”。(A system and associated method for organizing, representing, looking up, discovering and accessing data. Embodiments represent information and data in the form of data structures referred to as "feature maps". The feature graph includes nodes and edges, where an edge is used to "connect" a node to one or more other nodes. The nodes in the feature graph may represent variables, i.e., measurable objects, features, or factors. Edges in the feature graph may represent a measure of statistical association between a node and one or more other nodes retrieved from one or more sources. Data sets that demonstrate or support statistically relevant or metric relevant variables will be "linked" from the feature map.)

1. A computer-implemented method for identifying relevant datasets for training models related to a topic of interest, comprising:

accessing one or more sources, each source including information regarding statistical associations between study topics described in the source and one or more variables considered in the study;

processing the information obtained from each source to identify one or more variables considered in the study described in the source and, for each variable, to identify information about statistical associations between the variable and the study topic;

for at least one source, associating a data set with at least one of the one or more variables or with a study topic described in the source, the data set including one or more data used by the study to demonstrate statistical association or data representing a measure of the one or more variables with which the data set is associated;

storing the accessed processing results of the one or more sources in a database, the stored results including, for each source, a reference to each of one or more variables, a reference to a topic of study described in the source, information about statistical associations, and, if applicable, links or other elements that enable access to an associated data set;

constructing a feature graph based on processing the stored results of the accessed one or more sources, the feature graph comprising a set of nodes and a set of edges, wherein each edge in the set of edges connects one node in the set of nodes to one or more other nodes, and further wherein each node represents a variable found to be statistically associated with a topic described in a source, each edge represents a statistical association between a node and a topic described in a source or between a first node and a second node;

receiving a search request from a user, the search request specifying a topic of interest;

traversing a feature graph to identify one or more datasets associated with one or more variables, the one or more variables being statistically associated with the topic of interest or semantically associated with one or more variables, the one or more variables being statistically associated with the topic of interest;

filtering and sorting the identified one or more data sets; and

the results of the filtering and ranking of the identified one or more data sets are presented to the user.

2. The method of claim 1, wherein the source comprises one or more descriptions of an experiment, a study, a machine learning model, or an anecdotal observation.

3. The method of claim 2, wherein processing the one or more sources further comprises applying one or more of optical character recognition, image processing, natural language processing, or natural language understanding techniques to one or more accessed sources.

4. The method of claim 1, wherein storing results of processing the accessed sources in a database further comprises storing results in a representation of a graph, the graph comprising a plurality of nodes and a plurality of edges, each edge connecting one node to another node.

5. The method of claim 4, wherein one or more of the plurality of edges is associated with a statistically relevant metric.

6. The method of claim 1, wherein filtering and ranking the identified one or more datasets further comprises filtering or ranking based on one or more of (a) population and key, (b) compliance, (c) interpretability, or (d) operability.

7. The method of claim 1, further comprising training a model using the one or more presented data sets, wherein the model implements a machine learning technique.

8. The method of claim 7, further comprising using the training model to make decisions or classifications regarding inputs to the model.

9. The method of claim 1, wherein processing the accessed one or more sources further comprises accessing an ontology or a reference to obtain concept tags for one or more potential topics or one or more variables.

10. The method of claim 1, wherein the information about statistical associations is one of observed associations, relationships of metrics, or causal relationships.

11. The method of claim 1, wherein receiving a search request from a user further comprises receiving one or more control parameters for searching from the user, wherein the control parameters comprise one or more of data, population, quality, method, or author.

12. The method of claim 2, wherein accessing one or more sources further comprises accessing information that research topics are topics of interest.

13. The method of claim 1, further comprising providing a user with a subset of stored results that process the one or more sources accessed, and constructing a feature map for the user based on the subset of stored results.

14. The method of claim 13, wherein the subset of stored results is determined by one or more parameters provided by a user.

15. An electronic form of presentation information, comprising:

a data structure representing a graph, the graph including a plurality of nodes and a plurality of edges, each edge connecting a first node to a second node;

a set of values associated with one or more edges; and

at least one link or other element to allow access to the data set, the link or other element being associated with the first node or the second node;

wherein each node represents a variable found to have a statistical association with the topic of interest, and each value associated with an edge represents a measure of the statistical association between the node and the topic of interest, a measure of the statistical association between the first node and the second node, or a measure of confidence in the statistical association.

16. An electronic form representing information as claimed in claim 15, wherein said data set includes one or more data for establishing a statistical association between a first variable represented by said first node and a second variable represented by said second node, said data representing a measure of said first variable or said data representing a measure of a second variable.

17. The electronic form representing information as claimed in claim 15, wherein the statistical association is one of an observation association, a metric association, or a causal relationship.

18. The electronic form of presentation information of claim 15, wherein the dataset is associated with one of an experimental description, a study, a machine learning model, or anecdotal observations.

19. A data processing system comprising:

an electronic processor programmed with a set of computer-executable instructions;

a non-transitory electronic storage element storing the set of computer-executable instructions, wherein the set of computer-executable instructions further comprises

Computer-executable instructions that, when executed, cause a system to access one or a set of sources, wherein each source comprises information about statistical associations between study topics described in the source and one or more variables considered in the study;

computer-executable instructions that, when executed, cause the system to process the accessed one or more sources and identify, for each source, one or more variables in the study that are considered in the source and information for each variable identifying statistical associations between the variable and study topics;

computer-executable instructions that, when executed, cause a system to associate, for at least one source, a data set with at least one of one or more variables or with a study topic described in the source, the data set including one or more data used by the study to demonstrate data statistically associated or representative of a metric of the one or more variables with which the data set is associated; and

computer-executable instructions that, when executed, cause the system to store results of processing the accessed one or more sources in a database, the stored results including, for each source, a reference to one or more variables, a reference to a subject matter described in the study, information about statistical associations, and, if applicable, links or other elements that can be used to access the data set.

20. The data processing system of claim 19, further comprising:

computer-executable instructions that, when executed, cause a system to construct a feature graph based on processing stored results of the accessed one or more sources, the feature graph comprising a set of nodes and a set of edges, wherein each edge in the set of edges connects one node in the set of nodes to one or more other nodes, further wherein each node represents finding a variable associated with study topic statistics described in the source, each edge represents a statistical association between a node and a study topic described in the source or between a first node and a second node;

computer-executable instructions that, when executed, cause a system to receive a search request from a user, the search request specifying a topic of interest;

computer-executable instructions that, when executed, cause a system to traverse a feature graph to identify one or more datasets associated with one or more variables, the one or more variables being statistically associated with a topic of interest or semantically associated with one or more variables, the one or more variables being statistically associated with a topic of interest;

computer-executable instructions that, when executed, cause a system to filter and sort the identified one or more data sets; and

computer-executable instructions that, when executed, cause the system to display to a user results of filtering and sorting the identified one or more data sets.

21. The data processing system of claim 19, wherein the one or more sources include one or more descriptions of an experiment, a study, a machine learning model, or anecdotal observations.

22. The data processing system of claim 19, wherein processing the accessed one or more sources further comprises applying one or more of optical character recognition, image processing, natural language processing, or natural language understanding techniques to one or more of the accessed sources.

23. The data processing system of claim 19, wherein storing results of processing the accessed one or more sources in a database further comprises storing results in a representation of a graph, the graph including a plurality of nodes and a plurality of edges, each edge connecting one node to another node.

24. The data processing system of claim 20, further comprising computer-executable instructions that, when executed, the system trains the model using one or more identified data sets.

25. The data processing system of claim 24, further comprising computer-executable instructions that, when executed, cause the system to receive an input data set for a model and, in response, generate an output from the model.

26. The data processing system of claim 25, wherein the output is one or more of a classification or a decision.

27. The data processing system of claim 20, further comprising computer-executable instructions that, when executed, cause the system to provide a user with a subset of stored results that process the accessed one or more sources, and construct the feature map for the user based on the subset of stored results.

28. The data processing system of claim 27, wherein the subset of stored results is determined by one or more parameters provided by a user.

Background

The data is used as part of many learning and decision-making processes. Such data may relate to topics, entities, concepts, etc. However, to be useful, such data must be able to be efficiently discovered, accessed, and processed, or otherwise utilized. Further, it is desirable that the data be related (or, in some cases, sufficiently related) to the task being performed or the decision being made. Making a reliable data-driven decision or prediction requires data not only about the expected outcome of the decision or the prediction objective, but also about variables that are statistically relevant to that outcome or objective (ideally all variables, but at least the largest). Unfortunately, it is difficult today to find which variables have proven statistically relevant to a result or goal using conventional methods, and to obtain data about these variables.

This problem also exists in machine learning, where it is important to identify and construct a suitable training set for the learning process. However, as recognized by the inventors, obtaining reliable training data is today very difficult due in large part to the traditional way of organizing information and data.

In many cases, data may be more efficiently discovered and accessed by representing the data in a particular format or structure. The format or structure may include a tag for one or more columns, rows, or fields in the data record. Conventional methods of identifying and discovering data of interest are typically based on semantically matching words with tags in (or referring to or about) a dataset. While this approach is useful for discovering and accessing data about topics (e.g., goals or outcomes) that may be relevant, it does not address the problem of discovering and accessing data about topics (variables) that cause, affect, predict, or otherwise statistically correlate to the topic of interest.

Embodiments of the system, apparatus and method of the present invention aim to address and solve these and other problems or disadvantages of conventional solutions for organizing, representing, looking up, discovering and accessing data individually and collectively.

Disclosure of Invention

The terms "invention," "the invention," "this invention," and "the invention" as used herein are intended to refer broadly to all subject matter described in this document and to the claims. Statements containing these terms should be understood not to limit the subject matter described herein or to limit the meaning or scope of the claims. Embodiments of the invention covered by this patent are defined by the claims, rather than by the summary. This summary is a high-level overview of various aspects of the invention and is intended to introduce a selection of concepts that are further described below in the detailed-description section. This summary is not intended to identify key, required, or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to the appropriate portions of the entire specification of this patent, any or all of the drawings, and each claim.

Embodiments of systems and related methods for organizing, representing, finding, discovering, and accessing data are described herein. In some embodiments, the information and data is represented in the form of a novel data structure referred to herein as a "signature graph" (the subject matter of the pending trademark application; note that "system" is also the subject matter of the pending trademark application). A feature graph is a graph or chart that contains nodes and edges, where an edge is used to "connect" a node to one or more other nodes. The nodes in the feature graph may represent variables, i.e., measurable quantities, objects, characteristics, features, or factors. Edges in the feature graph may represent a measure of statistical association between a node and one or more other nodes.

Statistical associations are typically (although in some embodiments, not exclusively) from performing one or more steps found in the scientific approach of the survey (typically described as including steps or stages, such as (1) making observations, (2) making hypotheses (hypotheses), (3) deriving logical outcomes from these predictions, and then (4) conducting experiments based on these predictions to determine whether the initial hypotheses are correct.

Because of the wide range of types of statistical associations represented in feature maps, and the wide range of sources of information and/or data used to construct feature maps, embodiments of the systems and methods described herein employ mathematical, language-based, and visualization methods to represent quality, stringency, and reliability, supporting the trustworthiness, reproducibility, reliability, and/or integrity of information and/or data for a given statistical association.

In one embodiment, the invention relates to a computer-implemented method for identifying relevant datasets for training a model related to a topic of interest. The embodiment includes a set of instructions (e.g., software modules or routines) to be executed by a programmed processing element. The method includes accessing a set of sources that include information regarding statistical associations between a study topic and one or more variables considered in the study. The information contained in the source is used to construct a data structure or representation that contains nodes and edges connecting the nodes. An edge may be associated with information about a statistical association between two nodes. One or more nodes may have a data set associated with them that can be accessed using a link or other form of address or access element. Embodiments may include functionality that allows a user to describe and perform searches of data structures to identify data sets that may be relevant to training a machine learning model for making a particular decision or classification.

Other embodiments may be represented by a data structure that includes nodes, edges, and links to data sets. Nodes and edges represent concepts, topics of interest, or topics previously studied. Edges represent information about statistical associations between nodes. Links (or other forms of addresses or access elements) provide access to data sets that establish (or support, prove, etc.) statistical associations between one or more variables that are part of a study, or between variables and concepts or topics.

Other embodiments may include training a particular machine learning model using one or more datasets identified using the methods and data structures described herein. The trained model may then be used to make a decision or "prediction," or to classify a set of input data. The trained models can be used for signal or image processing, adaptive control systems, sensor systems, and the like.

Other objects and advantages of the present invention will become apparent to those skilled in the art upon a reading of the detailed description of the invention and the included drawings.

Drawings

Embodiments according to the invention will be described with reference to the accompanying drawings, in which:

FIG. 1(a) is a block diagram illustrating an architecture that may be used to implement embodiments of the systems and methods described herein;

FIG. 1(b) is a screen shot illustrating user interface icons that may be used in implementations of embodiments of the systems and methods described herein to more easily enable a user to control a search and identify a location to insert a search query;

FIG. 1(c) is a diagram showing user interface icons that may be used for standard or conventional semantic searches;

FIG. 1(d) is a diagram illustrating user interface icons that may be used to perform a statistical search for the same search inputs as shown in FIG. 1 (c);

FIG. 2(a) is an embodiment of a flow chart (flow chart) or flow diagram (flow diagram) illustrating a process, method, function, or operation for building a feature map (from data contained in a central database or "system database" that may provide data for use in multiple feature maps and is a central instance of a feature map) using embodiments of the systems and methods described herein;

FIG. 2(b) is a flow diagram (flow chart) or flow diagram (flow diagram) illustrating a process, method, function, or operation of an example use case in which a feature graph is traversed to identify potentially relevant datasets and may be implemented in embodiments of the systems and methods described herein;

FIG. 3 is a diagram illustrating an example of a portion of a feature map data structure that may be used to organize and access data and information and that may be created using an implementation of an embodiment of the systems and methods described herein;

FIG. 4 is a diagram illustrating elements or components that may be present in a computer device or system configured to implement methods, processes, functions or operations according to embodiments of the invention; and

FIG. 5 is a diagram illustrating an example system architecture of a service platform that may be used to implement embodiments of the systems and methods described herein.

Note that the same numbers are used throughout the disclosure and figures to reference like components and features.

Detailed Description

The subject matter of embodiments of the present invention is described with specificity herein to meet statutory requirements, but such description is not necessarily intended to limit the scope of the claims. The claimed subject matter may be embodied in other ways, may include different elements or steps, and may be used in conjunction with other present or future technologies. This description should not be construed as implying any particular order or arrangement between or between various steps or elements unless explicitly described as an order to various steps or element arrangements.

Embodiments of the present invention now will be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, exemplary embodiments by which the invention may be practiced. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements and will convey the scope of the invention to those skilled in the art.

The invention may be embodied in whole or in part as a system, one or more methods, or one or more apparatuses, among others. Embodiments of the invention may take the form of a hardware-implemented embodiment, a software-implemented embodiment or an embodiment combining software and hardware aspects. For example, in some embodiments, one or more of the operations, functions, processes, or methods described herein may be implemented by one or more suitable processing elements (e.g., processors, microprocessors, CPUs, GPUs, controllers, etc.) as part of a client device, server, network element, or other form of computing or data processing device/platform. The processing elements are programmed with a set of executable instructions (e.g., software instructions), which may be stored in suitable data storage elements.

In some embodiments, one or more of the operations, functions, processes, or methods described herein may be implemented by hardware in a dedicated form, such as a programmable gate array (PGA or FPGA), an Application Specific Integrated Circuit (ASIC), or the like. Note that embodiments of the inventive method may be implemented in an application, a subroutine that is part of a larger application, a "plug-in," an extension of the functionality of a data processing system or platform, or other suitable form. The following detailed description is, therefore, not to be taken in a limiting sense.

As noted above, machine learning represents a general case that benefits from using embodiments of the described systems and methods. A useful machine learning model can generate output that a user can base a decision with sufficient confidence. In order to build a successful model, an appropriate data set needs to be identified and constructed to train the learning process represented by the model. However, as recognized by the inventors, identifying and accessing training data (sometimes referred to as "source signatures") is largely difficult because the manner in which information and data is organized is conventional.

Furthermore, the inventors have also recognized that the most relevant, accurate and effective training data will be data that empirical (or other reliable) studies indicate is relevant to decisions made using the model. For example, if a data set shows a demonstrable statistical association between one or more variables and a result, a model for determining whether the result will occur may be properly trained from the data set. Similarly, if the dataset used in the study of the subject does not support sufficient statistical correlation, does not display or take certain variables into account, it may not be useful to train the model.

Embodiments of the systems and methods described herein may include the construction or creation of a graphic database. In the context of this description, a graph is a set of objects that are paired if they have some kind of close or related relationship. An example is two pieces of data representing a node and connected by a path. One node may be connected to many nodes, and many nodes may be connected to a particular node. A path or line connecting a first and second node or nodes is called an "edge". An edge may be associated with one or more values; these values may represent characteristics of the connected nodes, metrics (measures) or measures (e.g., statistical parameters) of relationships between one or more nodes, and the like. The graphical format may make it easier to identify certain types of relationships, such as those that are more central to a set of variables or relationships, those that are less important, and so forth. Graphics are generally of two main types: "undirected" and "directed," which means that the relationships represented by the graph are symmetric, and "directed" means that the relationships are not symmetric (in the case of a directed graph, the aspects of the relationships between nodes may be represented by arrows rather than lines).

In some embodiments, one or more of the operations, functions, processes, or methods described herein may be implemented in whole or in part by a system that retrieves information about different degrees of statistical association between variables from structured and unstructured sources (along with data or data sets that confirm or support the association), and constructs and stores the retrieved information in a data structure that may be used to generate what is referred to herein as a "signature graph". The feature map represents the subject of the study, the variables examined in the study, statistical associations between variables and one or more variables and/or between variables and the subject, and includes links or other forms of access to a set of data (referred to as a data set) or measurable quantities that provide support for the statistical associations. Such a contact may also or instead be a data set that measures variables in different populations (e.g., women 18 years and older; japan).

In some embodiments, statistical associations are expressed in numerical and/or statistical terms, and may vary in significance from observed associations to metric relationships to causal relationships. Some embodiments of the system employ mathematical, language-and vision-based methods to represent the quality, stringency, trustworthiness, reproducibility, and/or completeness of information and/or data that supports a given statistic or observed association.

For example, a given statistical association may be associated with a particular score, label, and/or icon in a user interface based on its scientific "quality" or reliability (both generally and based on a particular parameter such as "has been reviewed by a peer") to indicate to the user whether to further investigate the association. In other embodiments, the statistical associations retrieved by searching the feature map may be filtered based on the scientific quality scores of the feature map. In some embodiments, the calculation of the quality score may combine data stored within the profile (e.g., the statistical significance of a given association or the degree of record of the association) with data stored outside the profile (e.g., the number of citations received to retrieve an associated journal article, or the h-index of the article author). Note that the feature graph is used to represent and access statistically relevant data or information, and thus such quality metrics are more relevant to the use cases described herein than the metrics used in conventional knowledge graphs or semantic search results.

As previously mentioned, data using conventional methods is organized primarily based on language. For example, such an organizational form may be based on metadata about the data set (e.g., author name), tags of columns, rows, or fields in the data set, or semantic relationships between the user's search input and those data tags (e.g., equivalence, sufficient similarity, as a common synonym, etc.). The latter approach is the core premise of a "knowledge graph," which represents facts related to a topic and semantic relationships between them. For example, apple is "a fruit" produced "in new york. Using the knowledge graph, searching the dataset on "apple" theoretically could retrieve datasets for other fruits (e.g., oranges) or other fruits produced in new york (e.g., pumpkins). Data in public domains and companies is organized primarily based on linguistic and semantic relationships between tags or terms.

As an example of a knowledge graph-based search, assume that two data sets generated by the state judicial in california contain data about a 2017 state crime in california, where one data set contains data about intentional corruption and the other contains data about theft. A conventional data (or "features" in machine learning terms) search or management platform based on knowledge graphs will retrieve both data sets in response to one or more of the search terms "california," california state jurisdictions, "and/or" 2017. Furthermore, a data/feature search or management platform employing a knowledge graph may retrieve both data sets by searching for "corruption" or "theft" because both terms are expected to be semantically related to a common category or label of "crime" in the knowledge graph.

Thus, using conventional methods, a dataset may be looked up based on the language in or about the dataset (i.e., the search terms that "match" the tags or metadata), as well as based on semantic relationships between words in the dataset and the search terms (e.g., by referencing general categories or tags to which other words are semantically associated or linked). Thus, if the data scientist knows the topic (or variable) to search, she can, at least in theory, find potentially relevant data (although this depends on the assumed completeness of semantic association in the knowledge graph).

However, knowledge graph structures or methods of organizing and finding data are not suitable for certain applications, such as predictive modeling and machine learning. This is because in a typical predictive analysis or machine learning task, a data scientist or researcher knows his or her own topic or goal (i.e., the ultimate goal, result, or object of the study), but does not know which data (such as factors, variables, or characteristics) is most helpful in predicting it or its value (e.g., the presence or absence of a certain condition). Thus, the data scientist does not know what topics or influencing factors (i.e., those topics or factors that may be relevant to or most likely to predict the subject of the study) to search for. This situation makes identifying and accessing relevant data using traditional data management platforms or knowledge graph approaches inefficient and potentially unreliable. Indeed, it is widely recognized that one of the most challenging parts of machine learning today is to find suitable training data for machine learning models.

The traditional method of organizing data and some of its disadvantages are shown in the following table:

FIG. 1 is a block diagram illustrating an architecture 100 that may be used to implement embodiments of the systems and methods described herein. A brief description of an example architecture follows:

framework

● in some embodiments, the architectural elements or components shown in FIG. 1 may be differentiated based on their functionality and/or based on how access to the elements or components is provided. Functionally, the architecture 100 of the system distinguishes between:

○information/data access and retrieval(as shown by application 112 adding/editing 118 and open sciences 103) -these are sources of information and descriptions that provide data, variables, topics, concepts, and statistics that underlie the generation of feature maps or similar data structures for experimental, research, machine learning models, etc.;

○database with a plurality of databases(as shown in system database (SystemDB) 108) -electronic data storage media or components and utilizing appropriate data structures or schemas and data retrieval protocols/methods; to be provided withAnd

○application program(illustrated as application 112 and website 116) -these applications are executed in response to instructions or commands received from public users (the public 102), clients 104, and/or administrators 106. The application may perform one or more useful operations or functions, such as:

■ search the system database 108 or the feature map 110 and retrieve variables, datasets, and other information relevant to the user query;

■ identifying a particular node or relationship of the feature map;

■ to make data accessible to the public 102 or to customers owning or controlling access to the data or to others outside of the enterprise 104 (note that in this sense, the customers 104 act as elements of an information/data retrieval architecture/source);

■ generating a feature map from the specified dataset;

■ characterizing particular feature maps, etc., in terms of one or more metrics (metrics) or measures (measurers) of complexity, relative degree of statistical significance, etc.; and/or

■ obtain recommendations for training a data set of the machine learning model.

● from the perspective of accessing the system and its functionality, the architecture of the system distinguishes between elements or components accessible to the public 102, defined clients, businesses, organizations or a group of businesses or organizations (e.g., industry alliances or "data collaborations" in the social sector) 104, and elements or components accessible to administrators of the system 106;

● information/data about or demonstrating statistical associations between topics, factors, or variables may be retrieved (i.e., accessed and obtained) from a number of sources. These may include, but are not limited to, journal articles, technical and scientific publications and databases, digital "notebooks" for research and data science, experimental platforms (e.g., a/B testing), data science and machine learning platforms, and/or public websites (element/website 116) where a user may enter observed statistical (or anecdotal) relationships between observed variables and topics, concepts or goals;

for example, using Natural Language Processing (NLP), Natural Language Understanding (NLU), and/or computer vision to process images (as shown by input/source processing element 120), components of the information/data retrieval architecture can scan (e.g., by using Optical Character Recognition (OCR)), or "read" published or otherwise accessible scientific journal articles and identify words and/or images that indicate a measured statistical association (e.g., by identifying the term "add" or other related term or description) and, in response, retrieve information/data about the associated and measured (e.g., provided support) associated data set (as shown by the element labeled "open science" 103 in the figure and step or stage 202 of fig. 2 (a));

other components of the information/data retrieval architecture (not shown) may provide a way for users to enter code into their digital "notebook" (e.g., Jupyter notebook) to retrieve the metadata output of a machine learning experiment (e.g., is a "feature importance" measure of features used in a given model) as well as information of the data sets used in the experiment;

note that in some embodiments, information/data retrieval is typically performed periodically or continuously, providing new information to the system for storage and construction for disclosure to the user;

● in some embodiments, algorithm/model types (e.g., logistic regression), model parameters, numerical values (e.g., 0.725), units (e.g., log loss), statistical properties (e.g., p-value 0.03), feature importance, feature ratings, model performance (e.g., AUC scores), and other statistical values about associations are identified and stored as a search;

given that researchers and data scientists may use different words to describe the same or very similar concepts, variable names (e.g., "aerobic exercises") are stored as retrieved names, which can then be semantically based on (i.e., linked or associated with) a public domain ontology (e.g., Wikidata)) to facilitate variable clustering (and related statistical associations) based on common or typical synonymous or closely related terms and concepts;

■ for example, a variable labeled "Log _ House _ sell _ price" by a given user might be semantically associated by the system (and further confirmed by the user) with "real estate price," which is a topic Q58081362 with a unique ID in the wiki data;

● As described herein, a central database ("System database" 108) stores the retrieved information/data and its associated data structures (i.e., nodes, edges, values). An instance or projection of a central database containing all or a subset of the information/data stored in the system database may be made available to defined customers, businesses, or organizations 104 (or groups thereof) for their own use (e.g., in the form of a "feature map" 110);

because access to a particular profile may be limited to certain individuals associated with a given enterprise or organization, the profile may be used to represent information/data about variables and statistical associations that may be considered private or proprietary to the given enterprise or organization 104 (e.g., employment data, financial data, product development data, development (R & D) data, etc.);

each client/user has its own instance of the system database in the form of a signature graph. All feature maps read data from the system database simultaneously, frequently in most cases, to ensure that the users of the feature maps have up-to-date knowledge stored in the system database;

● can develop ("build") an application 112 on top of the feature map 110; some applications may read data from, some may write data to, and some may both. An example of an application is a Recommender system for Data sets (referred to herein as a "Data Recommender"), which will be described in more detail. If they wish to share certain information/data with a broader group of users or the public outside of their organization, the clients 104 using the profile 110 may "write" the information/data to the system database 108 using the appropriate application 112;

the O application 112 may be integrated with the client's 104 data platform and/or Machine Learning (ML) platform 114. An example of a data platform is Google cloud storage. The ML (or data science) platform may include software such as Jupyter notebooks;

■ for example, such a data platform integration would allow a user to access functionality recommended by a data recommender application in a customer's data store or other data store. As another example, for example, data science/ML platform integration would allow users to query feature maps from within notebooks;

note that in addition to or instead of such integration with the customer's data platform and/or Machine Learning (ML) platform, an administrator may provide access to applications to the customer using a suitable service platform architecture, such as software as a service (SaaS) or similar multi-tenant architecture. Further description of the main elements or features of such an architecture is described herein with reference to fig. 5.

● in some embodiments, web-based applications are accessible by the public 102. Com "116, the user can read from and write to the system database 108 in a manner similar to a web site such as Wikipedia (Wikipedia) (as suggested by the add/edit function 118 in the figure); and

● in some embodiments, data stored in the system database 108 and disclosed on System m.com 116 may be provided to the public for free in a manner similar to the Wikipedia or like website.

FIG. 1(b) is a screen shot illustrating a user interface icon 150 (also shown in FIG. 1 (d)) that may be used in implementations of embodiments of the systems and methods described herein to distinguish statistical searches (names or labels given by the inventors to the types of searches described herein), more easily enable a user to trigger and control statistical searches, and identify locations (outlined query input "boxes") into which to insert statistical search queries 160.

Note that in contrast to search bar plus magnifying glass icons (e.g., which Google and other popular search engines use to visually represent the search depth they provide), as shown in fig. 1(c), embodiments may instead use a "minimap" 150 that includes two nodes and one edge connecting the nodes, signaling the user that the statistical search is being implemented in a broader sense than standard semantic searches (i.e., looking for statistical associations), and letting the user control various aspects of the search. By selecting source node 151, target node 152, or both, the user may specify their intent with respect to traversal of the feature graph. For example:

● by selecting the lower of nodes 151, the user can specify her interest in knowing what the search input is relevant to, what it predicts, and what it causes;

● by selecting the higher of the nodes 152, the user may specify her interest in knowing what predictions or caused the search input; or

● by selecting both nodes 151 and 152, the user may specify her interest in knowing how more than one search input is relevant.

In operation, user selection of one or both nodes in the user interface element filters statistical search results to find paths (and related variables) upstream of the search input (input is the target), downstream of the search input (input is the source), or linking both inputs.

As illustrated in the description of fig. 1(b) and other information in this application, there is a fundamental difference between standard semantic searches and "statistical searches" as described herein. The ability to perform and present statistical search results is one of the advantages and benefits of the systems and methods described herein that enable a user to retrieve one or more variables associated with their input statistics. Such a search process is only possible in the feature map data structure.

● A traditional search that employs semantic relationships will have the following characteristics:

inputting: variables or concepts

And (3) outputting: all nodes that match or semantically relate to the input may be filtered by a user-specified type (e.g., dataset).

Example (c):

inputting as a smoker

Output is smoking, smoker, cigarette, etc.

The search bar or user input is shown in FIG. 1 (c).

● in contrast, the statistical search implemented by embodiments of the systems and methods described herein has the following features:

inputting: variables or concepts

And (3) outputting: variables and/or concepts related to input statistics may be filtered by user-specified types (e.g., data sets).

Example (c):

inputting as a smoker

High blood pressure, income per week, male sex, etc.

The search bar or user input is shown in FIG. 1 (d).

Further, the ranking of the output results may take into account the value and quality of the association.

FIG. 2(a) is a flow chart (flow chart) or flow diagram (flow diagram) illustrating a process, method, function or operation for constructing a feature graph 200 using an implementation of an embodiment of the systems and methods described herein. Fig. 2(b) is a flow chart (flow chart) or flow diagram (flow diagram) illustrating a process, method, function, or operation of an example use case in which a feature graph is traversed to identify potentially relevant datasets 220, and may be implemented in embodiments of the systems and methods described herein.

As shown in the figure (in particular, fig. 2(a)), a profile is constructed or created by identifying and accessing a set of sources that contain information and data regarding statistical associations between variables or factors used in the study (as suggested by step or stage 202). Such information may be retrieved periodically or continuously to provide information about variables, statistical associations, and data used to support these associations (as suggested by 204). As previously described, this information and data is processed to identify variables used or described in these sources and statistical associations between one or more of these variables and one or more other variables.

Continuing with FIG. 2(a), a data/information source is accessed at 202. The accessed data/information is processed to identify variables and statistical associations found in one or more sources 204. As described above, such processing may include image processing (e.g., OCR), Natural Language Processing (NLP), Natural Language Understanding (NLU), or other forms of analysis that are useful in understanding the contents of journal papers, study notebooks, lab logs, or other records of a study.

Further processing may include linking certain variables to ontologies (e.g., international disease classification) or other data sets that provide semantically equivalent or semantically similar terms to those used for the variables (as suggested by step or stage 206). This helps to extend the variable names used in a particular study to a larger set of substantially equivalent or similar entities or concepts that might be used in other studies. Once identified, the variables (which, as noted, may be known by different names or tags) and statistical associations are stored in a database (208), such as system database 108 of FIG. 1. The results of processing the accessed information and data are then constructed or represented according to a particular data model (as suggested by step or stage 210); the model will be described in more detail herein, but generally includes elements for constructing a feature graph (i.e., nodes representing topics or variables, edges representing statistical associations, metrics including statistical associations, or evaluated metrics). The data model is then stored in a database (212); it can be accessed to construct or create a feature graph for a particular user or group of users.

As previously described, the process or operation described with reference to FIG. 2(a) enables the construction of a graph (an example of which is shown in FIG. 3) that contains nodes and edges connecting certain nodes. A node represents a topic, goal, or variable of a study or observation, and an edge represents a statistical association between the node and one or more other nodes. Each statistical association may be associated with one or more numerical values, model types or algorithms, and statistical characteristics describing the strength, confidence or reliability of the statistical association between nodes (variables, factors or topics) connected by edges. Note that the numerical values, model types or algorithms, and statistical properties associated with the edges may indicate relevance, predictive relationships, causal relationships, anecdotal observations, and the like.

Once the information and data is accessed and processed for storage in a database (e.g., a system database (which may contain unprocessed data and information), processed data and information, and data and information stored in the form of a data model), a feature map containing a set of specified variables, topics, goals, or factors may be constructed. The profile for a particular user may include all of the data and information in the system database or a subset thereof. For example, the profile (110 in FIG. 1) for a particular customer 104 may be constructed based on selecting from the system database 110 data and information that satisfies conditions such as the applicability of a given domain (e.g., public health) in the system database to a customer's domain of interest (e.g., media);

● it is noted that in deploying/generating/constructing a profile for a particular customer or user, the data in the system database may be filtered to improve performance by deleting data unrelated to the question or concept/topic being investigated.

The following table summarizes some of the differences between the signature and knowledge maps:

after a feature graph is constructed for a particular user or group of users, the graph can be traversed to identify variables of interest to the topic or goal of a study, model, or survey, and if desired, to retrieve a data set that supports or confirms the relevance of these variables or measures the variables of interest. Note that the process of traversing the feature map can be controlled by one of two methods: (a) explicit user adjustment of search parameters or (b) algorithm-based adjustment of variable/data retrieval parameters. For example, in the use case described in the section entitled "other use cases or environments in which the inventive method would be of value", user tuning will typically be utilized, while in the use case in which a data recommender application is used, algorithm tuning will typically be utilized.

For example, as shown in FIG. 2(b), the constructed or created feature graph (222) may be traversed to identify data sets that are of potential value for a particular survey, topic, study, or analysis. In the example process shown in the figure, a user may enter factors to be used as part of defining a search query (step or stage 224). These factors may include objects/topics, variables or factors of interest (e.g., "housing prices"), and parameters of the model being built (e.g., may be merged with key "census blocks" and measured in the population "chicago, 2017"). The data recommender application (e.g., 112 in FIG. 1) then traverses the feature map to identify a set of data that is expected to be relevant and useful for model training (step or stage 226). The identified data sets may then be sorted, filtered, or otherwise sorted (step or stage 228, described in more detail) before being presented to the user (step or stage 230).

FIG. 3 is a diagram illustrating an example of a portion of a feature map data structure 300, which feature map data structure 300 may be used to organize and access data and information, and may be created using an implementation of an embodiment of the systems and methods described herein. A description of the elements or components of feature map 300 and the associated data model of implementation is provided below.

Characteristic diagram

● As previously mentioned, a profile is a way to construct, represent and store statistical associations between topics and their associated variables, factors, categories, etc. The core elements or components (i.e., "building blocks") of the feature map are the variables (identified in FIG. 3 as V1, V2, etc.) and statistical associations (identified as connecting lines or edges between the variables). The variable may be associated with or linked to a "concept" (identified in the figure as C1), which is a semantic concept or topic that is not necessarily measurable by itself (e.g., the variable "number of robberies" may be associated with the concept "crime"). Variables are measurable empirical objects or factors. In statistics, an association is defined as "any statistical relationship between two random variables, whether causal or not. "statistical association is generated by one or more steps or phases of a so-called scientific method, e.g., can be described as weak, strong, observable, measurable, related, causal, predictive, etc.;

as an example, referring to fig. 3, a statistical search of the input variable V1 retrieves: (i) variables statistically associated with V1 (e.g., V6, V2) (in some embodiments, variables may be retrieved only when the statistically relevant value is above a defined threshold), (ii) variables statistically associated with those variables (e.g., V5, V3, V4) (in some embodiments, variables may be retrieved only when the statistically relevant value is above a defined threshold), (iii) variables semantically related by a common concept (e.g., C1) to one or more variables (e.g., V2) that are statistically related to input variables V1 (e.g., V7), (iv) variables related to those variables (e.g., V8); and datasets that measure related variables or prove statistical associations of retrieved variables (e.g., D6, D2, D5, D3, D4, D7, D8);

■ Note that, by contrast, a semantic search on the input variable V1 retrieves only: (1) the variable V1, and (2) a dataset (e.g., D1) that measures the variable;

● feature maps populated with information/data about statistical associations retrieved from, for example, journal articles, scientific and technical databases, digital "notebooks" for research and data science, experimental logs, data science and machine learning platforms, public websites, users can enter observed or perceived statistical associations and other possible sources;

as previously described, components of the information/data retrieval architecture may scan or "read" published scientific journal articles, identify words or images (e.g., "adds") that indicate statistical associations have been measured, and retrieve information/data about the associations and about the datasets that measure/confirm the associations, using Natural Language Processing (NLP), Natural Language Understanding (NLU), and/or image processing (OCR, visual recognition) techniques;

other components of the information/data retrieval architecture provide data scientists and researchers with a way to enter code into their digital "notebook" (e.g., Jupyter notebook) to retrieve the metadata output of machine learning experiments (e.g., the "feature importance" measure of the features used in a given model) and information about the data sets used in the experiments. Note that information/data retrieval is performed periodically, and in some cases continuously, to provide new information to the system for storage, construction, and disclosure to the user;

● in one embodiment, a data set is associated with a variable in a feature graph having a link to a URI of the related data set/bucket/pipe (e.g., UCI census revenue data set is located athttps://archive.ics.uci.edu/ml/ machine-learning-databases/adult/_) Or other form of access or address;

this allows the user of the feature map to retrieve data sets based on previously proven or determined predictive capabilities of data about the specified target/subject (rather than potentially less relevant or irrelevant data sets about subjects semantically related to the specified target/subject, as in the knowledge map);

for example, using embodiments of the systems and methods described herein, if a data scientist searches for "vandalism" as a target subject or goal for a study, they will retrieve a data set that has been displayed for predicting the subject of that goal/subject-e.g., "family income," "brightness," and "traffic density" (and statistical evidence related to the goal) -rather than a data set that measures vandalism events;

● the associated numerical value (e.g., 0.725) and statistical property (e.g., p-value 0.03) are stored in the system database (or constructed signature graph) at the time of retrieval. As previously mentioned, whereas researchers and data scientists may use different words to describe the same concept, variable names (e.g., "aerobic exercises") are stored as retrieved names and are semantically based on a common domain ontology (e.g., wiki data) to facilitate variable clustering (and statistical association) based on common or similar concepts (e.g., synonyms);

● the system employs mathematical, linguistic, and visual methods to express the cognitive attributes of the recorded evidence, such as quality, stringency, trustworthiness, reproducibility, and completeness of the information and/or data that support a given statistical association;

for example, a given statistical association may carry a particular score, label, and/or icon in the user interface based on its scientific quality (collectively and with respect to a particular parameter such as "peer review") to indicate to the user at a glance whether to investigate the association further. In some embodiments, the statistical associations retrieved by searching the feature map are filtered based on their scientific quality scores. In some embodiments, the calculation of the quality score may combine data stored within the profile (e.g., the degree of statistical significance or record of association given the association) with data stored outside the profile (e.g., the number of citations received to retrieve a journal article of the association, or the h-index of the article author);

for example, a statistical association with a high and significant "feature importance" score, measured in a model with a high area under the curve (AUC) score, with a partial dependency graph (PDP), information recorded for reproducibility may be considered as a "strong" statistical association in the feature graph, and given an identifying color or icon in the graphical user interface;

note that in addition to retrieving variables and statistical associations, embodiments may also retrieve other variables used in the experiment to contextualize the statistical associations of the users. This may be helpful, for example, if the user wants to know whether certain variables are controlled in the experiment, or which other variables (or features) are included in the model.

Data model

The primary objects in the feature map (or system database) typically include one or more of the following:

● variable (or characteristic) -what you are measuring, in what crowd?

● concept-what are the topics or concepts you are learning?

● neighborhood-what are your measured objects?

● statistical association-what is the mathematical basis and value of this relationship?

● model (or experiment) -what is the source of the metric?

● data set-what is the data set for the metric relationships (e.g., training set) or metric variables?

The association of these objects in the feature map is as follows (as shown in fig. 3):

● variables are linked to other variables by statistical associations;

● statistical association comes from the model and is supported by the data set; and

● variables are associated with concepts and concepts are associated with neighborhoods.

For example, the variable "7-12 level skin problem" may be in a feature map (toAnd system database, central database) linked to a linear probabilistic model-based variable "personal income", with a correlation value of 0.126, a standard error of 0.047, a significance level of 0.1, managed by Miaion, Hugo M. and Nesson, Erik, published in DOI:10.2139/ssrn.2964045(Do Pimples PayAcne, Human Capital, and the Labor Market) in American female samples of the first variable measured in 1994 and the second variable measured in 2007 and 2008, and confirmed that the dataset is located in the Labor Markethttps://www.cpc.unc.edu/projects/addhealth/documentation/publicdata. The variable "7-12 rated skin problems" may also be semantically linked to the concept "acne vulgaris", and the variable "personal income" may be semantically linked to the concept "personal income", both of which are named from ontologies (e.g., wiki data).

Referring to fig. 2(b) and 3, as previously described, one use of the feature map is to enable a user to search the feature map for one or more data sets containing variables that have been displayed as being associated with a target topic, variable or concept statistic of a study. For example:

● the user enters a target variable and wishes to retrieve all of the data sets that can be used to train the model to predict the target variable, i.e. those linked to variables associated with the target variable statistics (as shown at 224 in FIG. 2 (b));

for example, referring to FIG. 3, statistical search input V1 causes an algorithm (e.g., Breadth First Search (BFS)) to traverse the graph and return: (i) variables statistically associated with V1 (e.g., V6, V2) (in some embodiments, variables may be retrieved only when the statistically relevant value is above a defined threshold), (ii) variables statistically associated with those variables (e.g., V5, V3, V4) (in some embodiments, variables may be retrieved only when the statistically relevant value is above a defined threshold), (iii) variables semantically related by a common concept (e.g., C1) to one or more variables (e.g., V2) that are statistically related to input variables V1 (e.g., V7), (iv) variables related to those variables (e.g., V8); and datasets that measure or demonstrate statistical significance of the retrieved variables (e.g., D6, D2, D5, D3, D4, D7, D8);

● after traversing the feature graph and retrieving potentially relevant datasets, the datasets may be "filtered", ranked or otherwise ordered according to the application or use case:

the data sets retrieved through the traversal process described above may then be filtered based on criteria entered by the user at the time of the search and/or criteria entered by an administrator of the software instance. An example search data set filter may include one or more of:

■ population and Key: is the focus variable measured in terms of the groups and key points of interest to the user (e.g., unique identifiers of users, species, cities, companies, etc.)? This can affect the user's ability to add data to the machine learning training set;

■ compliance: is the dataset compliant with applicable regulatory considerations (e.g., GDPR compliance)?

■ Interpretability (Interpretability)/Interpretability (explanability): can humans interpret this variable?

■ may take action: can a user of the model operate on the variables?

In one embodiment, the user may enter concepts such as "crime," "wealth," "hypertension," and so on (represented by C1 in FIG. 3). In response, the systems and methods described herein may identify the following by using a combination of semantic and/or statistical search techniques:

● (C2) semantically associated with C1 (note that this step may be optional);

● and C1 and/or C2SemanticsRelated variable (Vx);

● with each variable VxStatistics ofA related variable;

● metrics or metrics of identified statistical associations; and

● measure each variable Vx and/or data sets that prove or support statistical associations of variables that are statistically related to each variable Vx.

In some embodiments of the systems and methods described herein, a plurality of edges (statistical associations) will link a given pair of nodes (variables, factors, or concepts), indicating a plurality of pieces of evidence about the statistical associations between the given pair of nodes. Given the breadth of sources from which the system may retrieve information and the evolving nature of science and technology, it is also envisioned that the set of edges will contain or represent a series of associated values (and/or relationships).

● in this case, the system will "read" the relevant information in the database and generate additional edges (called "summary associations") that represent the statistics and the knowledge summaries (e.g., assignments of value, degree of consensus on nature and strength of association, demographics that have measured associations, etc.). Note that the application may retrieve summary-associated edges, e.g., to provide the user with a "bird's eye view" of a given area of interest, and answer questions about the consensus of a particular statistical association set, how the particular statistical association set changes over time, and what was or was not studied in which groups of people.

Data set recommendation

In some embodiments, a data recommender application may be used to take advantage of the benefits of the feature map. In a typical use case, a user (data scientist) enters a desired goal or topic ("goal") and a model purpose, and a data recommender retrieves the "best" data set for her use in training the model. In one embodiment, the data recommender algorithm/process traverses the feature graph, ranks the most predictive relationships based on the statistics and metadata stored in the feature graph, based on certain data availability factors (e.g., keys needed for data joining) and/or based on the specified use of the model (e.g., the model requires interpretable/interpretable properties, or the model cannot use protected class information, etc.), and then returns one or more datasets (and variables for which no datasets are available or available) to the user.

Unlike statistical searches of feature maps where the user controls the retrieval of key parameters of variables and data sets (e.g., minimum strength of association or metadata quality), the data recommender application may perform parameter adjustment work for the user and return variables and data sets that are expected to have the highest relevance to the user. To generate a data set recommendation, an application may consider a number of features or signals, including, for example:

● target hop count: evidence of a direct association between a variable and a target is more important than evidence of an indirect association between a variable and another variable directly associated with a target;

● semantic association: variables retrieved by traversing a concept should be semantically related to the concept. Strong associations are weighted higher than weak associations;

● causal relationship: a variable associated by causal relationship with the target has a greater weight than a variable associated by non-causal relationship;

● model accuracy: variables associated with more accurate models have more weight than variables associated with less accurate models; and/or

● feature importance: in the model of the associated source, variables with relatively higher and/or significant feature importance have greater weight than variables with lower and/or insignificant feature importance.

Other potential uses of the system and process embodiments of the present invention

The inventors envision that the user further utilizes the system database to provide context to the readers and viewers of content on the internet. For example, a news website may link concepts or variables referenced in an article to associated objects in a system database and retrieve (through an API) a graph that may be embedded in the news article, thereby providing the reader with context for known statistical associations of concepts or variables referenced in the article.

The inventors also contemplate that users utilize feature maps in an organization to facilitate knowledge sharing and collaboration among data scientists regarding the performance of various ML (machine learning) models and features. The inventors also contemplate that the user records ML experiments and models using feature maps in the tissue.

The inventors also contemplate that users utilize feature maps in an organization to maintain a central dictionary of variable terms (or labels), subject terms, concept terms, key terms, and other concepts required by data science. For example, when a user creates a new variable, the dictionary will be referenced by the feature map to encourage universal naming of the universal entities/objects.

The inventors also contemplate that users utilize a profile in an organization to encourage non-technical employees to share their observations and assumptions about statistical associations that affect their system. For example, the manager may have anecdotal evidence that variables external to the company affect the price of a commodity in its supply chain and submit observations as "unverified" statistical associations to a profile for study by the company's data scientists.

The inventors also contemplate that users in large government and non-government organizations further utilize feature maps to inform them how to organize teams and resources and to make strategic plans. For example, by referencing their feature maps, an organization can identify certain relationships between key business variables or metrics and coordinate teams or projects to improve the metrics in a more systematic manner.

The inventors also contemplate that users utilize the system database to understand, model, and visualize as the world or a portion of the world of a complex system. For example, through a data visualization application, a virtual reality or augmented reality application, or an immersive installation, a typical user may browse complex interdependencies in a particular neighborhood of a system database. Alternatively, for example, by utilizing a large number of statistical associations in a given domain, a technical user may study and model the dynamics of a particular system and compare those dynamics across different populations.

The inventors also contemplate users performing network science and link prediction for a given sub-graph using a system database or feature map. For example, an application can be created that allows a technical user to select a particular form of statistical association, generate a sub-graph that contains these associations in a particular domain, and then measure network science attributes, such as centrality (e.g., understanding the centrality of variables in a public health system). As another example, a user may utilize information and data in the feature graph about edges linked to a given node to predict edges of similar nodes:

● in this use case, the user can utilize the knowledge contained in the profile about the associations between variable A and other variables in a given population to predict the associations between particular variables B that are substantially similar to variable A (where such substantiality can be determined by a priori knowledge about the nature of the variables in question, e.g., the shape of the molecule and its relevance to the physical influence).

The inventors also contemplate that users utilize system databases or feature maps to infer causal relationships, where a key challenge is identifying potential confounding factors. The inventors believe that the technical process of large-scale causal reasoning will be significantly improved by collective intelligence, in particular by exploiting the unprecedented number, richness and diversity of associations contained in the system database, which are derived from a wide variety of experiments and studies in different populations and contributed by different users.

The inventors also contemplate that users utilize system databases and profiles to simulate possible consequences of particular events, decisions, and operations. For example, an application may be built on top of a system database, allowing a user to define a particular set of conditions for a set of variables and to simulate possible effects on other variables.

The inventors also contemplate that the user utilizes the system database and the feature map to guide investment decisions. For example, a user may use the system database to consider the unexpected consequences of a particular financial event (e.g., a change in price of a given good) to hedge an investment.

The inventors also contemplate users utilizing the system database and signature graph as training data for Artificial General Intelligence (AGI). For example, the system database may be used to train the AIs to know known statistical associations around the world.

Fig. 4 is a diagram illustrating elements or components that may be present in a computer device or system configured to implement methods, processes, functions or operations according to embodiments of the invention. As previously mentioned, in some embodiments, the systems and methods of the present invention may be embodied in the form of an apparatus, system or device that includes a processing element and a set of executable instructions. The executable instructions may be part of a software application and arranged into a software architecture.

In general, embodiments of the invention may be implemented using a set of software instructions designed for execution by a suitably programmed processing element (e.g., CPU, microprocessor, processor, GPU, controller, computing device, etc.). In a complex application or system, such instructions are typically arranged in "modules," each of which typically performs a particular task, procedure, function, or operation. The operation of the entire set of modules may be controlled or coordinated by an Operating System (OS) or other form of organizational platform. Each application module or sub-module may correspond to a particular function, method, process, or operation implemented by the module or sub-module. The functions, methods, procedures or operations may include functions, methods or operations for implementing or representing one or more aspects of the present systems and methods, including but not limited to the aspects described with reference to fig. 1(a), 1(b), 1(c), 1(d), 2(a), 2(b) and 3.

For example, an application module or sub-module may contain software instructions that, when executed, cause a system or device to perform one or more of the following operations or functions:

● generate a user interface to enable a user to enter search terms or concepts C1 (e.g., topics of interest or variables related to the topics) for initiating statistical and/or semantic searches, and/or one or more controls for searching;

note that examples of such user interfaces are described with reference to FIGS. 1(b), 1(c), and 1 (d);

● determines a concept semantically associated with C1 (C2) (this may be an optional feature and based on access to the appropriate ontology or reference);

● determining variables (Vx) associated with the C1 and/or C2 semantics by performing a search on the feature map;

● determining variables statistically associated with each variable Vx by performing a search on the feature map;

● determining one or more metrics of the identified statistical association;

● identifying data sets that measure each variable Vx and/or prove or support statistical associations of variables statistically related to each variable Vx; and

● present the user with a ranking or listing of the identified data sets that is filtered by one or more user-specified criteria, if desired.

The application modules and/or sub-modules may comprise any suitable computer-executable code or set of instructions (e.g., to be executed by a suitably programmed processor, microprocessor, GPU or CPU), such as computer-executable code corresponding to a programming language. For example, programming language source code may be compiled into computer executable code. Alternatively, or additionally, the programming language may be an interpreted programming language such as a scripting language.

As described above, the systems, apparatus, methods, procedures, functions and/or operations for implementing embodiments of the present invention may be implemented in whole or in part in the form of a set of instructions executed by one or more programmed computer processors (e.g., a Central Processing Unit (CPU), GPU or microprocessor). Such processors may be incorporated in devices, servers, clients, or other computing or data processing apparatus that are operated by or in communication with other components of the system.

By way of example, fig. 4 is a diagram illustrating elements or components that may be present in a computer device or system 400 configured to implement methods, processes, functions or operations according to embodiments of the invention. The subsystems shown in fig. 4 are interconnected by a system bus 402. Additional subsystems include a printer 404, a keyboard 406, a fixed disk 408, and a monitor 410, which monitor 410 is coupled to a display adapter 412. Peripheral devices and input/output (I/O) devices coupled to I/O controller 414 may be connected to the computer system by any number of means known in the art, such as serial port 416. For example, serial port 416 or external interface 418 may be utilized to connect computer device 400 to other devices and/or systems not shown in FIG. 4, including a wide area network, such as the Internet, a mouse input device, and/or a document scanner. The interconnection via system bus 402 allows one or more electronic processors 420 to communicate with each subsystem and to control the execution of instructions, which may be stored in system memory 422 and/or fixed disk 408, and the exchange of information between subsystems. System memory 422 and/or fixed disk 408 may contain tangible computer-readable media.

As described above, the methods, processes, functions or operations described with reference to fig. 1 to 3 may be implemented as a service for one or more users or a group of users. In some embodiments, the service may be provided through the use of a service platform operable to provide service to a plurality of customers, each customer having a separate account. Such a platform may have an architecture similar to a multi-tenant platform or system, which may be referred to as a SaaS (software-as-a-Service) platform. An example architecture of such a platform is described with reference to fig. 5.

FIG. 5 is a diagram illustrating an example system architecture 500 of a service platform that may be used to implement embodiments of the systems and methods described herein. In some embodiments, a service platform (multi-tenant or other "cloud-based" system) that provides access to one or more data, applications, and data processing capabilities includes websites (e.g., serviceplatform.com), apis (restful web services), and other support services; the website operation follows the standard MVC (model-view-controller) architecture:

● model-model objects are parts of an application that implement application data field logic. Typically, model objects retrieve and store model states in a database. For example, a Bill object might retrieve information from a database, operate on it, and then write the updated information back to the Bills table in the SQL _ services database;

● View-the view is a component that displays an application User Interface (UI). Typically, this UI is created from model data. For example, the edit view of the Bills table displays text boxes, drop-down lists, and check boxes according to the current state of the Bill object; and

● controller-the controller is the component that handles user interaction, handles the model, and ultimately selects the view to be presented (displays the UI). In an MVC application, a view only displays information; the controller processes and responds to user inputs and interactions. For example, the controller processes the query string values and passes these values to the model, which in turn can use these values to query the database.

Com website (element, component, or process 502) provides access to one or more of data, data storage, applications, and data processing capabilities, in one embodiment. The application or data processing capability or function may include, but is not necessarily limited to, one or more of the data processing operations described with reference to fig. 1-3. The website architecture is based on a standard MVC architecture, with its controller interacting indirectly with service processes and resources (e.g., models or data) using API web services (elements, components, or processes 504). An API web service is comprised of a web service module (element, component, or process 508) and one or more web service modules (element, component, or process 510) that perform embodiments of the processes or functions disclosed herein, which are feature graph construction and search (or other application) service modules. Com controller, the web service module (508) reads data from the input and starts or instantiates the service module (510). Both the Web services module 508 and the feature map services module 510 may be part of the Web services layer 506 of the architecture or platform.

API services may be implemented in the form of standard "Restful" web services, which is one way to provide interoperability between computer systems on the Internet. REST-compliant Web services allow requesting systems to access and manipulate textual representations of Web resources using a unified, predefined set of stateless operations.

Referring to FIG. 5, as previously described, embodiments of one or more of the processes described with reference to FIGS. 1-3 may be accessed or utilized via service platform website 502 or service platform API 504. The service platform will include one or more processors or other data processing elements, typically implemented as part of a server. The service platform may be implemented as a collection of layers or multiple layers, including a UI layer 520, an application layer 530, a web services layer 506, and a data store layer 540. The user interface layer 520 may include one or more user interfaces 522, where each user interface is comprised of one or more user interface elements 524.

The application layer 530 is typically comprised of one or more application modules 532, where each application module is comprised of one or more sub-modules 534. As described herein, each sub-module may represent executable software instructions or code that, when executed by a programmed processor, performs a particular function or process, such as those described with reference to fig. 1-3.

Thus, each application module 532 or sub-module 534 may correspond to a particular function, method, process, or operation implemented by the module or sub-module (e.g., a function, method, process, or operation associated with providing certain functionality to a platform user). Such functions, methods, procedures or operations may include those for implementing one or more aspects of the present systems and methods, such as by:

● generating a user interface to enable a user to enter search terms or concepts C1 for initiating statistical and/or semantic searches and/or one or more controls for searching;

● determines a concept semantically associated with C1 (C2) (this may be an optional feature and based on access to the appropriate ontology or reference);

● determining variables (Vx) associated with the C1 and/or C2 semantics by performing a search on the feature map;

● determining variables statistically relevant to each variable Vx by performing a search on the feature map;

● determining a metric or measure of the identified statistical association;

● identifying the metric variable Vx and/or the data set for each variable that demonstrates or supports the statistical association of the variables statistically associated with each variable Vx; and

● present the user with a ranking or listing of the identified data sets that is filtered by one or more user-specified criteria, if desired.

Note that in addition to the listed operations or functions, the application modules 532 or sub-modules 534 may contain computer-executable instructions that, when executed by a programmed processor, cause a system or device to perform functions related to the operation of the service platform. Such functions may include, but are not limited to, functions related to user registration, user account management, data security between accounts, allocation of data processing and/or storage capabilities, providing access to data sources (e.g., ontologies, reference profiles, etc.) other than system databases.

The application modules and/or sub-modules may comprise any suitable computer-executable code or set of instructions (e.g., instructions executed by a suitably programmed processor, microprocessor or CPU), such as computer-executable code corresponding to a programming language. For example, programming language source code may be compiled into computer executable code. Alternatively, or additionally, the programming language may be an interpreted programming language such as a scripting language. Each application server may include each application module. Alternatively, different application servers may include different sets of application modules. These sets may be disjoint or overlapping.

Similarly, the Web services layer 506 may be comprised of one or more Web services modules 508, and each module also includes one or more sub-modules (and each sub-module represents executable instructions that when executed by a programmed processor perform a particular function or process). For example, web services module 508 may include modules or sub-modules for providing support services (as suggested by support services module 512) and providing functionality associated with the services and processes described herein (as suggested by feature map services module 510). Accordingly, in some embodiments, module 510 may include software instructions that, when executed, implement one or more of the functions described with reference to other figures (in particular, fig. 1-3).

Data store layer 540 can include one or more data objects 542, wherein each data object is comprised of one or more object components 544, such as attributes and/or behaviors. For example, a data object may correspond to a table of a relational database, and a data object component may correspond to a column or field of such a table. Alternatively, or in addition, the data object may correspond to a data record having fields and associated services. Alternatively, or in addition, the data objects may correspond to persistent instances, such as structures and classes, of the programming data object. Each data store in the data store layer may include each data object. Alternatively, different data stores may include different sets of data objects. These sets may be disjoint or overlapping.

The architecture of figure 5 is an example of a multi-tenant architecture that may be used to provide users with access to various data stores and executable applications or functions (sometimes referred to as providing software as a service (SaaS)). Although fig. 5 and its accompanying description focus on a service platform for providing functionality associated with the processes described with reference to fig. 1-3, it is noted that a more general form of a multi-tenant platform may be used that includes the ability to provide other services or functionality. For example, the service provider may also provide the user with the ability to perform certain data analysis, billing, account maintenance, scheduling, e-commerce, ERP functions, CRM functions, and the like.

Note that the example computing environment described in the figures is not intended to be limiting examples. Alternatively, or in addition, the computing environment in which embodiments of the invention may be implemented includes any suitable system that allows a user to provide, access, process and utilize data stored in a data storage element (e.g., a database) that is remotely accessible over a network. Another example environment in which embodiments of the invention may be implemented includes devices (including mobile devices), software applications, systems, devices, networks, or other configurable components that may be used by multiple users for data input, data processing, application execution, data review, having a user interface or user interface components that may be configured to present an interface to a user. Although further examples may refer to the example computing environments depicted in the drawings, it will be apparent to those skilled in the art that the examples may be applicable to alternative computing devices, systems, apparatuses, processes, and environments. Note that embodiments of the inventive method may be implemented in an application, a subroutine that is part of a larger application, a "plug-in," an extension to the functionality of a data processing system or platform, or any other suitable form.

It should be understood that the invention as described above may be implemented in the form of control logic using computer software in a modular or integrated manner. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement the present invention using hardware and a combination of hardware and software.

Different arrangements of the components depicted in the drawings or described above, as well as components and steps not shown or described, are possible. Similarly, some features and subcombinations are of utility and may be employed without reference to other features and subcombinations. Embodiments of the present invention have been described for purposes of illustration and not limitation, and alternative embodiments will become apparent to the reader of this patent. Accordingly, the present invention is not limited to the embodiments described above or depicted in the drawings, and various embodiments and modifications may be made without departing from the scope of the following claims.

Any software components, procedures or functions described in an application may be implemented as software code executed using a processor (e.g., conventional or object-oriented technology) used by any suitable computer language (e.g., Python, Java, JavaScript, C + +, or Perl). The software code may be stored as a series of instructions or commands in (or on) a non-transitory computer readable medium, such as a Random Access Memory (RAM), a Read Only Memory (ROM), a magnetic medium (e.g., a hard drive or a floppy disk), or an optical medium (e.g., a CD-ROM). In this case, the non-transitory computer-readable medium is virtually any medium suitable for storing data or instructions set forth from the transitory waveform. Any such computer-readable media may reside on or within a single computing device, and may exist on or within different computing devices within a system or network.

According to an example embodiment, the term "processing element" or "processor" as used herein may be a Central Processing Unit (CPU), or conceptualized as a CPU (e.g., a virtual machine). In this example implementation, the CPU or a device containing the CPU may be coupled, connected, and/or in communication with one or more peripheral devices (e.g., a display). In another implementation, the processing element or processor may be incorporated into a mobile computing device such as a smartphone or tablet.

Non-transitory computer readable storage media as referred to herein may include a plurality of physical drive units, such as Redundant Array of Independent Disks (RAID), floppy disk drives, flash memory, USB flash drives, external hard drives, thumb drives, pen drives, key drives, high-density digital versatile disk (HD-DVD) optical disk drives, internal hard drives, blu-ray disk drives, or Holographic Digital Data Storage (HDDS) optical disk drives, Synchronous Dynamic Random Access Memory (SDRAM), or similar devices or other forms of memory based on similar technology. Such computer-readable storage media allow the processing element or processor to access computer-executable process steps, application programs, etc., stored on the removable and non-removable memory media to offload data from or upload data to the device. As described above, with respect to the embodiments described herein, a non-transitory computer-readable medium includes virtually any structure, technique, or method, except for a transitory waveform or similar medium.

Certain implementations of the disclosed technology are described herein with reference to system block diagrams and/or flow charts (flowcharts) or flow diagrams (flow diagrams) of functions, operations, procedures or methods. It will be understood that one or more blocks of the block diagrams, or one or more stages or steps of the flowcharts (flowcharts) or flow diagrams (flow diagrams), and combinations of blocks in the block diagrams and stages or steps of the flowcharts (flowcharts) or flow diagrams (flow diagrams), respectively, may be implemented by computer-executable program instructions. Note that in some embodiments, one or more blocks, stages or steps do not necessarily need to be performed in the order presented, or may not need to be performed at all.

These computer-executable program instructions may be loaded onto a general purpose computer, special purpose computer, processor, or other programmable data processing apparatus to produce a particular example of a machine, such that the instructions which execute on the computer, processor, or other programmable data processing apparatus create means for implementing one or more of the functions, operations, processes, or methods described herein. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement one or more of the functions, operations, processes, or methods described herein.

While certain implementations of the disclosed technology have been described in connection with what are presently considered to be the most practical and various implementations, it is to be understood that the disclosed technology is not limited to the disclosed implementations. On the contrary, the disclosed implementations are intended to cover various modifications and equivalent arrangements included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

This written description uses examples to disclose certain implementations of the disclosed technology, and also to enable any person skilled in the art to practice certain implementations of the disclosed technology, including making and using any devices or systems and performing any incorporated methods. The patentable scope of certain implementations of the disclosed technology is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural and/or functional elements that do not differ from the literal language of the claims, or if they include structural and/or functional elements with insubstantial differences from the literal languages of the claims.

All cited documents, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each cited document were individually and specifically indicated to be incorporated by reference and/or were set forth in its entirety herein.

The use of the terms "a" and "an" and "the" and similar referents in the specification and the following claims are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Unless otherwise indicated, the terms "having," "including," "containing," and similar references in the specification and claims below are to be construed as open-ended terms (e.g., meaning "including, but not limited to"). Recitation of ranges of values herein are merely indented to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., "such as") provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the embodiments of the invention.

31页详细技术资料下载

System and method for organizing and locating data

相关技术

网友询问留言