Computing cross products using mapreduce

文档序号：1866277 发布日期：2021-11-19 浏览：10次中文

阅读说明：本技术 使用映射归约计算叉积 (Computing cross products using mapreduce ) 是由阿司瓦斯·马诺哈兰尼古劳斯·桑塔格于 2020-01-24 设计创作，主要内容包括：接收到基于连接字段从数据集生成叉积的请求。连接字段指示每个叉积将从数据集的对应子集生成,其中,该子集与相同的键相关联。响应于接收到基于连接字段生成数据集的多个叉积的请求,对数据集执行映射归约作业集以生成叉积。执行映射归约作业集从数据集的对应子集生成键值对组。相应的键值对组的每个键值对包括相同的键。响应于执行映射归约作业集,接收识别每个键值对组的叉积的最终输出数据。(A request to generate a cross product from a data set based on a connection field is received. The join field indicates that each cross product is to be generated from a corresponding subset of the dataset, where the subsets are associated with the same key. In response to receiving a request to generate a plurality of cross products for a data set based on a join field, a mapreduce job set is performed on the data set to generate the cross products. Executing the mapreduce job set generates a key-value pair group from a corresponding subset of the data set. Each key-value pair of the corresponding key-value pair group comprises the same key. Final output data identifying cross products of each pair of key-value groups is received in response to executing the mapreduce job set.)

1. A method, comprising:

receiving, by a processing device, a request to perform a mapreduce job set that generates a plurality of cross products from a dataset based on a connection field, wherein the connection field indicates that each of the plurality of cross products is to be generated from a corresponding subset of the dataset, the subset being associated with a same key;

in response to receiving the request to execute the mapreduce job set, executing the mapreduce job set on the data set to generate the plurality of cross products, wherein executing the mapreduce job set generates key-value pair groups from corresponding subsets of the data set, wherein each key-value pair of a respective key-value pair group includes the same key; and

storing final output data of the mapreduce job set, the final output data including the plurality of cross products for each of the key-value pair groups, wherein a cross product of the plurality of cross products pairs each value of the respective key-value pair group with each remaining value of the respective key-value pair group to form a value pair.

2. The method of claim 1, wherein the connection field indicates that each of the plurality of cross products is to be generated on a per-key basis such that each of the plurality of cross products is to be generated from the corresponding subset associated with the same key, but not from data of the dataset associated with a different key.

3. The method of claim 1, wherein performing the mapreduce job set on the data set to generate the plurality of cross products comprises:

executing a first mapping phase of a first job of the mapreduce job set using the data set to generate first intermediate data, wherein the first intermediate data comprises the key-value pair groups, wherein key-value pairs from different key-value pair groups of the key-value pair groups have different keys; and

executing a first reduction phase of the first job of the mapreduce job set using the first intermediate data to generate first output data, wherein the first output data includes a first group of the key-value pair groups having a first modification key indicating an order of sorting and a number of key-value pairs in the first key-value pair group.

4. The method of claim 3, wherein performing the first mapping phase of the first job of the mapreduce job set using the data set to generate first intermediate data comprises:

identifying a parameter associated with the request to generate the plurality of cross products, wherein the parameter indicates a unit of time;

for each entry of the data set in question,

identifying a time range indicated in the entry;

incrementing the time range by the time unit, wherein the incrementing starts from an earliest time identified by the time range to a last time identified by the time range, wherein at each increment a timestamp is generated reflecting a time in the time range at the increment; and

generating one or more key-value pairs for the entry based on the incrementing, wherein a key of the one or more key-value pairs identifies the timestamp of the incrementing, and wherein a value of the one or more key-value pairs identifies data of the entry; and

and generating the key value pair group, wherein each key value pair of the corresponding key value pair group comprises the same timestamp.

5. The method of claim 3, wherein executing the mapreduce job set on the data set to generate the plurality of cross products further comprises:

executing a second mapping stage of a second job of the mapreduce job set using the first output data of the first job to generate second intermediate data, wherein the second intermediate data comprises the first key-value pair group, which comprises the first modification key; and

executing a second reduction phase of the second job of the mapreduce job set using the second intermediate data to generate second output data, wherein the second output data includes a first subset of the first key-value pair groups having a second modification key, wherein the first modification key of the first key-value pair group is modified to generate a second modification key identifying the first subset of the first key-value pair group.

6. The method of claim 5, wherein performing the second reduction phase of the second job of the mapreduce job set using the second intermediate data to generate second output data comprises:

sorting the key-value pairs of the first key-value pair group according to the sorting order identified by the first modifying key;

identifying a number of key-value pairs in the first set of key-value pairs indicated by an initial key-value pair of the ordered key-value pairs;

determining a number of key-value pairs for each of the sub-groups of key-value pairs of the first group of key-value pairs, wherein the number of key-value pairs in the first sub-group does not exceed a maximum number of key-value pairs; and

generating the second modified key for the first subset of key-value pairs such that the second modified key identifies a particular subset from the first subset and the particular subset does not exceed the maximum number of key-value pairs.

7. The method of claim 5, wherein performing a mapreduce job set on the data set to generate the plurality of cross products further comprises:

performing a third mapping phase of a third job of the mapreduce job set using the second output data of the second job to generate third intermediate data, wherein the third intermediate data comprises the first subset of key-value pairs; and

performing a third reduction phase of the third job of the mapreduce job set using the third intermediate data to generate third output data, wherein the third output data comprises a second subset of key-value pairs generated from the first subset of key-value pairs, wherein the second subset of key-value pairs comprises the first subset of key-value pairs and repeated key-value pairs of the first subset of key-value pairs, wherein at least keys of the repeated key-value pairs are modified.

8. The method of claim 7, wherein executing the mapreduce job set on the data set to generate the plurality of cross products further comprises:

performing a fourth mapping stage of a fourth job of the mapreduce job set using the third output data of the third job to generate fourth intermediate data, wherein the fourth intermediate data comprises the second subset of key-value pairs; and

performing a fourth reduction phase of the fourth job of the mapreduce job set using the fourth intermediate data to generate fourth output data, wherein each reducer of the fourth reduction phase receives a respective key-value pair of the fourth intermediate data having a same key, wherein at each reducer each value of the respective key-value pair is paired with each remaining value of the respective key-value pair to generate a new value having a new key, wherein the fourth output data includes the new key-value pair from each of the reducers of the fourth job.

9. The method of claim 8, wherein performing a mapreduce job set on the data set to generate the plurality of cross products further comprises:

performing a fifth mapping phase of a fifth job of the mapreduce job set using the fourth output data of the fourth job to generate fifth intermediate data, wherein the fifth intermediate data comprises the new key-value pairs from each of the reducers of the fourth job; and

executing a fifth reduction phase of the fifth job of the mapreduce job set to perform a deduplication operation to remove duplicate key-value pairs from the new key-value pairs from each of the reducers of the fourth job and to provide a cross-product of the plurality of cross-products for the first key-value pair group.

10. A system, comprising:

a memory; and

a processing device coupled to the memory to:

receiving a request to perform a mapreduce job set that generates a plurality of cross products from a dataset based on a connection field, wherein the connection field indicates that each of the plurality of cross products is to be generated from a corresponding subset of the dataset, the subset being associated with a same key;

11. The system of claim 10, wherein the connection field indicates that each of the plurality of cross products is to be generated on a per-key basis such that each of the plurality of cross products is to be generated from the corresponding subset associated with the same key, but not from data of the dataset associated with a different key.

12. The system of claim 10, wherein to execute the mapreduce job set on the data set to generate the plurality of cross products, the processing device is further to:

13. The system of claim 12, wherein to perform the first mapping phase of the first job of the mapreduce job set using the data set to generate first intermediate data, the processing device to:

identifying a parameter associated with the request to generate the plurality of cross products, wherein the parameter indicates a unit of time;

for each entry of the data set in question,

identifying a time range indicated in the entry;

and generating the key value pair group, wherein each key value pair of the corresponding key value pair group comprises the same timestamp.

14. The system of claim 12, wherein to perform the first mapping phase of the first job of the mapreduce job set using the data set to generate first intermediate data, the processing device to:

15. The system of claim 14, wherein to perform the first mapping phase of the first job of the mapreduce job set using the data set to generate first intermediate data, the processing device to:

16. The system of claim 15, wherein to perform the first mapping phase of the first job of the mapreduce job set using the data set to generate first intermediate data, the processing device is to:

17. The system of claim 16, wherein to perform the first mapping phase of the first job of the mapreduce job set using the data set to generate first intermediate data, the processing device to:

executing a fifth reduction phase of the fifth job of the mapreduce job set to perform a deduplication operation to remove duplicate key-value pairs from the new key-value pairs from each of the reducers of the fourth job and to provide the first key-value pair group with the cross-product of the plurality of cross-products.

18. A non-transitory computer-readable medium comprising instructions that, in response to execution by a processing device, cause the processing device to perform operations comprising:

receiving, by the processing device, a request to execute a mapreduce job set that generates a plurality of cross products from a dataset based on a connection field, wherein the connection field indicates that each of the plurality of cross products is to be generated from a corresponding subset of the dataset, the subset being associated with a same key;

19. The non-transitory computer-readable medium of claim 18, wherein performing the mapreduce job set on the data set to generate the plurality of cross products comprises:

20. The non-transitory computer-readable medium of claim 19, wherein performing the first mapping phase of the first job of the mapreduce job set using the data set to generate first intermediate data comprises:

identifying a parameter associated with the request to generate the plurality of cross products, wherein the parameter indicates a unit of time;

for each entry of the data set in question,

identifying a time range indicated in the entry;

and generating the key value pair group, wherein each key value pair of the corresponding key value pair group comprises the same timestamp.

Technical Field

The present disclosure relates to the field of data processing systems, and more particularly to computing cross products using a mapreduce framework.

Background

Large-scale data processing involves extracting data of interest from raw data in one or more data sets and processing the raw data into useful data products. Large-scale data processing in parallel and distributed processing environments typically involves distributing data and computations among multiple disks and processing devices to efficiently utilize aggregate storage space and computing power.

Drawings

Various embodiments of the present disclosure will be understood more fully from the detailed description given below and from the accompanying drawings of various embodiments of the disclosure.

Fig. 1 illustrates an example system architecture according to an embodiment of this disclosure.

Fig. 2 is a flow diagram illustrating a method for generating a cross product based on a connection field in accordance with an embodiment of the present disclosure.

FIG. 3 is a flow diagram illustrating a method for performing a mapreduce job set on a data set to generate cross products, according to an embodiment of the present disclosure.

Fig. 4A illustrates a diagram of a first job of a mapreduce job set for generating cross products from a data set based on a connection field, according to an embodiment of the present disclosure.

Fig. 4B illustrates a diagram of a second job of a mapreduce job set for generating cross products from a data set based on a connection field, according to an embodiment of the present disclosure.

FIG. 4C illustrates a diagram of a third operation for generating a mapreduce job set of cross products from a data set based on a connection field, according to an embodiment of the present disclosure.

Fig. 4E illustrates a diagram of a fifth job of the mapreduce job set for generating cross products from the data set based on the connection fields, according to an embodiment of the present disclosure.

Fig. 5 is a block diagram illustrating an example computer system in accordance with an embodiment of the present disclosure.

Detailed Description

The following description sets forth numerous specific details, such as examples of specific systems, components, and methods, etc., in order to provide a thorough understanding of several embodiments of the present disclosure. It will be apparent, however, to one skilled in the art that at least some embodiments of the present disclosure may be practiced without these specific details. In other instances, well-known components or methods are not described in detail or are presented in simple block diagram format in order to avoid unnecessarily obscuring the present disclosure. Accordingly, the specific details set forth are merely exemplary. Particular embodiments may vary from these exemplary details and still be considered within the scope of the present disclosure.

Aspects of the present disclosure relate to cross product generation using a mapreduce framework. Modern data centers typically include thousands of hosts that operate collectively to service requests from many more remote clients. During operation, the components of these data centers produce large amounts of machine-generated data. In general, the data may be converted into useful data products, and the converted data may be used in downstream processes, such as input to a trained machine learning model or used to perform certain operations, such as similarity analysis and scoring analysis.

One such data transformation is the cross product (also known as the "cartesian product"). A cross product may refer to a set of values derived from an operation (e.g., a cross product operation) that pairs each value of a data set with each other value of the same data set or each value of another data set or data sets. For example, data set a may include 4 entries: { value 1, value 2, value 3, value4 }. The cross product of data set a pairs each value of data set a with each remaining value of data set a. The cross product of data set a includes a set of values: { [ value 1, value 2], [ value 1, value 3], [ value 1, value 4], [ value 2, value 3], [ value 2, value 4], [ value 3, value 4] }. Creating cross products using large datasets can consume significant computer resources, such as computing, memory, and storage resources.

Mapreduce is a programming framework for parallel processing and generation of large data sets using computer clusters. The mapreduce job includes a map task and a reduce task. The mapping task may include one or more mapping operations. The reduction task may include one or more reduction operations. The mapping task performs filtering and storage of the data set, and the reduction task performs a summary operation.

In some cases, a single cross product derived from all data of a data set may not be a useful data product. Rather, a useful data product may include a plurality of cross products generated from a data set, where each cross product is based on a particular value (e.g., connected to a particular key). For example, a data set may have 5 entries whose values are as follows: [ User1: IP1], [ User 2: IP1], [ User 3: IP1], [ User1: IP2], [ User 2: IP2 ]. The cross product of the entire data set pairs each of the five values with every other value. Wherein a plurality of cross products of a data set, each based on a particular value (e.g., connected to a particular key), generate a cross product having a value of "IP 1" { [ User1: IP1, User 2: IP1], [ User1: IP1, User 3: IP1], [ User 2: IP1, User 3: IP1] } and another cross product for "IP 2 { [ User1: IP2], [ User 2: IP2 }.

In some conventional systems, mapreduce may be used to generate a single cross product for all values of a data set. The data set may be large, and generating cross products with large data sets yields larger data sets. For a cross product of a data set with 100 ten thousand entries, the resulting cross product may have 1 trillion entries. Generating a single cross product using a map that approximates all values of a data set (especially a large data set) may be inefficient and consume significant computing, memory, and storage resources. In other conventional systems, a data set may be split by value into multiple data sets, such that each data set has entries that contain particular values. A mapreduce job (or set of jobs) may be run on each dataset to generate a cross product for each dataset. However, for large datasets, splitting the dataset in the manner described above may produce thousands or even millions of smaller datasets. A mapreduce job (or set of jobs) may be created for each smaller data set, which may itself be impractical or untenable. The individual mapreduce jobs typically run in series, which can be slow and inefficient in the use of computer resources. Furthermore, performing efficient parallel processing can be challenging when cross products are generated using mapreduce. For example, data may be skewed such that data associated with a particular value (e.g., a key) may be much larger than other data associated with a different value (e.g., a key). Data skewing can result in inefficient use of computational resources in the mapreduce framework, as some processing nodes may spend a significant amount of time processing large data blocks, while other nodes are idle after processing small data blocks.

Aspects of the present disclosure address the above and other challenges by generating a plurality of cross products for a dataset, each cross product based on a join (join) field. The connection field may indicate a key (e.g., connected to a particular key) that will generate a key-value pair based on a particular data field of a data set entry. If the values in the data fields identified by the connection fields are the same, the keys of the generated key-value pairs are the same. Cross products may be generated for pairs of key values having the same key, such that a plurality of cross products of a data set are each based on a particular value.

In some embodiments, a mapreduce job set converts a data set into a plurality of key-value pair groups, wherein each key-value pair group shares the same key. The set of mapreduce jobs performed on the data set may further produce a plurality of cross products, where each cross product is for a pair of key-value pairs having the same key, without generating cross products for key-value pairs that do not share the same key.

In an embodiment, the mapreduce job set modifies the keys of the key-value pair groups to control the number of key-value pairs sent to any one reducer. By controlling the number of key-value pairs sent to any one reducer, the computational load of generating the cross product is distributed among the available reducers, which allows the cross product to be computed faster and more efficiently using computational, memory, and storage resources.

Thus, the techniques described herein allow multiple cross products to be generated from a data set using a mapreduce job set. The foregoing reduces computational resources (e.g., processing resources), memory resources, and storage resources by: by creating cross products based on the join field, which results in cross products that are each based on a particular value (e.g., joined to a particular key), rather than a single cross product for the entire dataset; and controls the number of key-value pairs processed by the reducer in downstream mapreduce operations by modifying the key names.

Fig. 1 illustrates an example system architecture 100 in accordance with embodiments of the present disclosure. System architecture 100 (also referred to herein as a "system") includes client devices 110A and 110B (also referred to herein generally as "client devices 110"), network 105, data store 106, collaboration platform 120, server 130, and computer cluster 150. It may be noted that system architecture 100 is provided for purposes of illustration and not limitation. In embodiments, the system architecture 100 may include the same, fewer, more, or different components configured in the same or different ways.

In one embodiment, the network 105 may include a public network (e.g., the Internet), a private network (e.g., a Local Area Network (LAN) or a Wide Area Network (WAN)), a wired network (e.g., Ethernet), a wireless network (e.g., an 802.11 network, a WLAN), a WLAN, a wireless network, a network,A network or wireless lan (wlan), a cellular network (e.g., a Long Term Evolution (LTE) network), a router, a hub, a switch, a server computer, or a combination thereof.

In one embodiment, the data store 106 can be a memory (e.g., random access memory), a cache, a drive (e.g., a hard drive), a flash drive, a database system, or another type of component or device capable of storing data. The data store 106 can also include multiple storage components (e.g., multiple drives or multiple databases) that can also span multiple computing devices (e.g., multiple server computers).

In embodiments, the server 130 may be one or more computing devices (e.g., a rack server, a server computer, a physical server cluster, etc.). In embodiments, the server 130 may be included in the collaboration platform 120, be a stand-alone system, or be part of another system or platform. The server 130 may include a cross product module 140.

In some embodiments, the collaboration platform 120 may be one or more computing devices (e.g., a rack server, a router computer, a server computer, a personal computer, a mainframe computer, a laptop computer, a tablet computer, a desktop computer, etc.), a data store (e.g., a hard disk, a memory, a database), a network, a software component, or a hardware component that may be used to perform the operations of the collaboration platform 120 and provide a user with access to the collaboration platform 120.

In embodiments, the collaboration platform 120 may also include a website (e.g., a web page) or application backend software that may be used to provide users with access to content provided by the collaboration platform 120. For example, a user may access the collaboration platform 120 using the collaboration application 114 on the client device 110. It may be noted that collaboration applications 114A and 114B may be referred to generally herein as collaboration applications 114. In some embodiments, collaboration application 114 may be two instances of the same application.

In embodiments, the collaboration platform 120 may be a social network providing connections between users or a user-generated content system that allows users (e.g., end users or consumers) to create content for the platform, where the created content may also be consumed by other users of the system. In embodiments of the present disclosure, a "user" may be represented as a single individual. However, other embodiments of the present disclosure encompass "users" (e.g., creating users) as entities controlled by a set of users or an automated source. For example, a set of individual users that are joined as a community or group in a user-generated content system may be considered a "user".

In one embodiment, the collaboration platform 120 may be a gaming platform, such as an online gaming platform or a virtual gaming platform. For example, the gaming platform may provide a single-player game or a multiplayer game to a community of users that may access games 122A-122Z or interact with games 122A-122Z via network 105 using client devices 110. In embodiments, for example, game 122 (also referred to as a "video game," "online game," or "virtual game") may be a two-dimensional (2D) game, a three-dimensional (3D) game (e.g., a 3D user-generated game using creator module 126), a Virtual Reality (VR) game, or an Augmented Reality (AR) game. In embodiments, a user (such as a game-playing user) may participate in a game with other game-playing users. In an embodiment, the game 122 may be played in real time with other users of the game 122.

In some embodiments, the game 122 may include an electronic file that may be executed or loaded using software, firmware, or hardware configured to present game content (e.g., digital media items) to an entity. In an embodiment, game 122 may be executed and presented using game engine 124. In some embodiments, games 122 may have a common rule set or common goal, and the environments of games 122 share a common rule set or common goal. In embodiments, different games may have different rules or goals from one another.

It may be noted that collaboration platform 120 hosting game 122 is provided for purposes of illustration and not limitation. In some embodiments, collaboration platform 120 may host one or more media items. Media items may include, but are not limited to, digital videos, digital movies, digital photos, digital music, audio content, website content, social media updates, electronic books, electronic magazines, digital newspapers, digital audio books, electronic periodicals, web blogs, Really Simple Syndication (RSS) feeds, electronic comics, software applications, and so forth. In embodiments, the media items may be electronic files that may be executed or loaded using software, firmware, or hardware configured to present the presentation of the digital media items to a user.

In some embodiments, the collaboration platform 120 or the client device 110 may include a game engine 124. In embodiments, game engine 124 may be used for development or execution of game 122. For example, game engine 124 may include a rendering engine ("renderer") for 2D, 3D, VR, or AR graphics, physics engines, collision detection engines (and collision responses), sound engines, scripting functions, animation engines, artificial intelligence engines, network functions, streaming functions, memory management functions, threading functions, scene graph functions, or video support for cut scenes, among other features. Components of game engine 124 may generate commands (e.g., rendering commands, collision commands, physical commands, etc.) that aid in computing and rendering game 122. In some embodiments, the game engine 124 of the client device 110 may work independently, in cooperation with the game engine 124 of the collaboration platform 120, or a combination of both.

In embodiments, the collaboration platform 120 may include a creator module 126. In embodiments, the creator module 126 may allow users of the collaboration platform 120 to become creating users who design or create environments in existing games 122, create new games, or create new game objects in games or environments.

In embodiments, the creator module 126 may allow a user to create, modify, or customize a role. In an embodiment, a character (or game object in general) is made up of components, where one or more components are selectable by a user, and these components are automatically linked together to assist the user in editing. One or more characters (also referred to herein as "avatars" or "models") may be associated with a user (also referred to herein as "game-playing user"), wherein the user may control the characters to facilitate user interaction with the game 122. In embodiments, a character may include components such as body parts (e.g., hair, arms, legs, etc.) and accessories (e.g., T-shirts, eyeglasses, decorative images, tools, etc.). In an embodiment, the body parts of the customizable character include a head type, a body part type (arms, legs, torso, and hands), a face type, a hair type, a skin type, and the like. In embodiments, the customizable accessory includes clothing (e.g., shirts, pants, hats, shoes, glasses, etc.), weapons, or other tools. In embodiments, the user may also control the dimensions (e.g., height, width, or depth) of the character or the dimensions of the components of the character. In an embodiment, a user may control the scale of a character (e.g., block, anatomical, etc.). It may be noted that in some embodiments, a character may not include character game objects (e.g., body parts, etc.), but a user may control the character (without character game objects) to facilitate user interaction with the game (e.g., a puzzle game in which character game objects are not rendered, but the user still controls the character to control actions within the game).

In an embodiment, the collaboration platform 120 executing the creator module 126 includes a user interface website or application (e.g., collaboration application 114) in which users (also referred to herein as "creating users," "creators," "owners," or "owning users") may access online computing resources (e.g., cloud resources) hosted by the collaboration platform 120 for purposes of building, managing, editing, and interacting with the personally owned games 122 or gaming environment. In embodiments, the creator module 126 includes tools that a user can use to create and instantiate a three-dimensional virtual game or environment. In embodiments, the creator module 126 is available to users who wish to create and manage their own private virtual games 122. In embodiments, a user may access the creator module 126 using the collaboration application 114. In embodiments, the creator module 126 may use a user interface (also referred to herein as a "developer interface") through the collaboration application 114 to allow a user to access the functionality of the creator module 126. In embodiments, the developer interface may be part of the collaboration application 114. For example, a developer interface of the collaboration application 114 may allow a user to access a library of game objects that may be selected by the user to build a game environment or to build the game 122. Users may publish their game objects through a developer interface so that the game is available to users of the collaboration platform 120.

In embodiments, the collaboration platform 120 may include a messaging module 128. In an embodiment, messaging module 128 may be a system, application, or module that allows users to exchange electronic messages via a communication system, such as network 105. The messaging module 128 may be associated with the collaboration application 114 (e.g., a module of the collaboration application 114 or a separate application). In embodiments, users may interact with the messaging module 128 and exchange electronic messages between users of the collaboration platform 120. The messaging module 128 may be, for example, an instant messaging application, a text messaging application, an email application, a voice messaging application, a video messaging application, a combination thereof, or the like.

In embodiments, the messaging module 128 may facilitate the exchange of electronic messages between users. For example, one user may be logged into a messaging application on client device 110A, while another user may be logged into a messaging application on client device 110B. Two users may begin a conversation, such as an instant messaging conversation. The messaging module 128 may help facilitate messaging conversations by sending and receiving electronic messages between users of the collaboration platform 120. In another embodiment, two users may use respective messaging applications to participate in an in-game conversation with each other, where the conversation may be part of a view that includes a playable game (gameplay).

In an embodiment, the client devices 110A-110B may each include a computing device, such as a Personal Computer (PC), a mobile device (e.g., a laptop, mobile phone, smartphone, tablet, or netbook computer), a network-connected television, a game console, and so forth. In some embodiments, client devices 110A-110B may also be referred to as "user devices. In embodiments, one or more client devices 110 may connect to the collaboration platform 120 via the collaboration application 114 at any given moment. It may be noted that the number of client devices 110 is provided as an illustration and not a limitation. In embodiments, any number of client devices 110 may be used.

In an embodiment, each client deviceDevice 110 may include an instance of collaboration application 114. In one embodiment, the collaboration application 114 may be an application that allows users to use and interact with the collaboration platform 120, such as to control virtual characters in a virtual game hosted by the collaboration platform 120, or to view or upload content, such as games 122, images, video items, web pages, documents, and so forth. In one example, the collaboration application 114 can be a web application (e.g., an application operating in conjunction with a web browser) that can access, retrieve, render, or navigate content (e.g., a virtual character in a virtual environment, etc.) provided by a web server. In another example, the collaboration application 114 may be a local application (e.g., a mobile application or game program) that is installed and executed locally at the client device 110 and that allows the user to interact with the collaboration platform 120. The collaboration application 114 may render, display, or present content (e.g., web pages, media viewers) to the user. In one embodiment, the collaboration application 114 may also include an embedded media player embedded in a web page (e.g.,a player).

In general, the functions described as being performed by the collaboration platform 120 in one embodiment may also be performed by the client devices 110A-110B or the server 130 in other embodiments (as appropriate). Further, functionality attributed to a particular component may be performed by different or multiple components operating together. The collaboration platform 120 may also be accessed as a service provided to other systems or devices through an appropriate Application Programming Interface (API).

In embodiments, the collaboration platform 120 may generate large amounts of data in the operation of the collaboration platform 120. For example, collaboration platform 120 may have millions of users participating in a user session per day to play or create game 122. A large amount of raw data related to a user session may be stored in one or more databases associated with data store 106. A session (also referred to herein as a "user session") may refer to a period of time that begins when an application (e.g., collaboration application 114) is opened to access collaboration platform 120 and ends when the application is closed. In some embodiments, the session may span a period of time (e.g., a time range) that begins when the application is open and the user is interacting with the collaboration platform 120. When the user is inactive for a threshold period of time, the session may end (e.g., even if the application is still open). The session information may include contextual information describing a particular session (e.g., start and end timestamps, client device type, internet protocol address used to access the collaboration platform 120, etc.) and user activity information describing user interactions with the collaboration platform 120 (e.g., user inputs to control character actions, text messages, etc.).

In an embodiment, the cross product module 140 may be used to determine the cross product of the data set. The cross product module 140 can use the mapreduce job set to determine a cross product for one or more data sets. A mapreduce job may refer to two phases of mapreduce (e.g., a map phase and a reduce phase). In the mapping phase, one or more mapping operations (e.g., mapping tasks) retrieve data (e.g., key-value pairs) from an input data file and generate intermediate data values in accordance with the mapping operations. In the reduction phase, one or more reduction operations (e.g., reduction tasks) merge or otherwise combine intermediate data values in accordance with the reduction operations (e.g., combining intermediate values that share the same key) to produce output data. A mapreduce job set may refer to two or more mapreduce jobs that are typically executed serially. For example, two mapreduce jobs executed in series may include a first mapreduce job (e.g., a map phase and a reduce phase) that produces an output that serves as an input to a second mapreduce job (e.g., another map phase and another reduce phase).

In some embodiments, a system for large scale processing of data in a parallel processing environment includes one or more computer clusters 150. It may be noted that computer cluster 150 is shown as a single cluster for purposes of illustration and not limitation. Computer cluster 150 may include one or more computer clusters. In an embodiment, computer cluster 150 includes one or more interconnect nodes 132 and 134A through 134N to perform common tasks such that computer cluster 150 may be considered a single computer system. For example, the computer cluster 150 includes a master node 132 (commonly referred to as "node 132") and worker nodes 134A-134N (commonly referred to as "nodes 134" or "worker nodes 134"). Each node 132 and 134 of computer cluster 150 may include, but is not limited to, any data processing device, such as one or more dies (die) of a processor, desktop computer, laptop computer, mainframe computer, personal digital assistant, server computer, handheld device, or multi-die processor, or any other device configured to process data. The nodes 132 and 134 of the computer cluster 150 may be connected to each other through a network, such as network 105. Each node 132 and 134 may run its own operating system instance.

In an embodiment, each node 132 and 134 of computer cluster 150 may have its own physical or virtual memory. The memory may include, but is not limited to, a main memory such as a Read Only Memory (ROM), flash memory, Dynamic Random Access Memory (DRAM), or Static Random Access Memory (SRAM). Each node of computer cluster 150 may have data stored on local storage (not shown), such as local storage disks. Computer cluster 150 and each node 132 and 134 of computer cluster 150 may further implement various network accessible server-based functions (not shown) or include other data processing devices.

In some embodiments, master node 132 may control aspects of the mapreduce job. For example, the master node 132 may determine how many mapping operations to use, how many reduction operations to use, which processes and processing devices (e.g., nodes) to use to perform the operations, where to store intermediate and output data, how to respond to processing failures, and so forth. The master node 132 may direct one or more worker nodes 134 to perform various operations of the map-reduce job. It may be noted that a single mapreduce job may run in parallel on one or more nodes 134 of computer cluster 150.

The nodes 134 of the computer cluster may perform mapping operations, reduction operations, or both. Individual nodes 134 may perform one or more mapping operations in parallel or in series. Individual nodes 134 may perform one or more reduction operations in parallel or in series. A "mapper" may refer to a node 134 that performs one or more mapping operations. A "reducer" may refer to the same or different nodes 134 that perform one or more reduction operations. In some embodiments, a single node 134 may include one or more mappers, one or more reducers, or both.

In an embodiment, computer cluster 150 may run a map-reduce framework. Computer cluster 150 may be configured to run a particular mapreduce framework, such as Apache^TM Infinispan or Apache^TMSpark^TM。

Computer cluster 150 may be associated with one or more queues 136. Queue 136 may include a data structure that stores elements. Queue 136 may assist computer cluster 150 with scheduling information associated with one or more mapreduce jobs.

In some embodiments, the elements stored in the queue 136 may include a tag 138. In some examples, indicia 138 includes actual data units on which computer cluster 150 performs one or more mapreduce operations. In other examples, the marker 138 may identify the location of the data unit stored at the data store 106. For example, each tag 138 may be associated with one or more lines of data in a database. Each marker 138 may identify a database, a start address of data (e.g., a start row in the database), and an end address of data (e.g., an end row in the database). For example, the tag 138 may be associated with rows 1 through 10,000 in the database. Each tag 138 may identify a fixed size address range. For example, the first tag may identify rows 1-10,000 of the database, while the second tag may identify rows 10,001-20,000 of the database.

In an embodiment, the elements in the queue 136 may be maintained in order and operations on the data structure may include adding and removing elements from the data structure. For example, queue 136 may be a first-in-first-out (FIFO) queue, where the first element added to the queue will be the first element to be removed from the queue.

In some embodiments, queue 136 is hosted by a cluster of computers 150, such as master node 132. In other embodiments, the queue 136 may be hosted by another component. For example, queue 136 may be hosted by a component external to computer cluster 150. The data of the queue 136 may be stored in a memory (e.g., random access memory), a cache, a drive (e.g., a hard drive), a flash drive, a database system, or another type of component or device capable of storing data.

In some embodiments, a user using client device 110 may submit a request for one or more mapreduce jobs to be executed by computer cluster 150. The master node 132 of the computer cluster 150 may receive the mapreduce job and determine the mapreduce operation to perform, and request one or more worker nodes 134 to perform various mapreduce operations. In an embodiment, aspects of the present disclosure may be implemented by the cross-product module 140 executed by the master node 132. In other embodiments, the cross-product module 140 executed by the master node 132, the worker nodes 134, or both may implement aspects of the disclosure.

For purposes of illustration and not limitation, cross-product module 140 is described as being implemented at master node 132. In other embodiments, the cross product module 140 may be partially or fully implemented at the collaboration platform 120. In other embodiments, the cross product module 140 may be partially or fully implemented at one or more client devices 110. In other embodiments, the cross product modules 140 operating at one or more of the client devices 110, the computer cluster 150, or the collaboration platform 120 may work in concert to perform the operations described herein. Although embodiments of the present disclosure are discussed in terms of a collaboration platform, embodiments may also apply generally to any type of platform that generates or stores data. The cross product module 140 may help facilitate the operations described herein, such as the operations described with respect to fig. 2-4. In some embodiments, the cross product module 140 may be part of another application, such as a plug-in. In some embodiments, the cross-product module 140 may be a stand-alone application executing on a computing device.

Where the system discussed herein collects or otherwise makes available personal information about a user, the user may be provided with the following opportunities: control whether the collaboration platform 120 collects user information (e.g., information about the user's social network, social actions or activities, profession, user preferences, or the user's current location); or to control whether or how content is received from the collaboration platform 120 that may be more relevant to the user. In addition, some data may be processed in one or more ways before being stored or used in order to delete personally identifiable information. For example, the user's identity may be processed so that personally identifiable information of the user cannot be determined, or the user's geographic location may be generalized when obtaining location information, such as a city, postal (ZIP) code, or state level, so that a particular location of the user cannot be determined. Thus, the user may control how information about the user is collected and used by the collaboration platform 120.

Fig. 2 is a flow diagram illustrating a method 200 for generating cross products based on connection fields, in accordance with an embodiment of the present disclosure. Method 200 may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, programmable logic, microcode), software (e.g., instructions run on a processing device to perform hardware simulation), or a combination thereof. In some embodiments, cross product module 140 executing at computer cluster 150 (e.g., at one or more of nodes 132 or 134) may perform some or all of the operations. In other embodiments, the cross product module 140 executing at the collaboration platform 120, the client device 110A, the client device 110B, the server 130, or a combination thereof may perform some or all of the operations. It may be noted that in some embodiments, method 200 may include the same, different, fewer, or greater number of operations performed in any order.

At block 205 of method 200, processing logic executing method 200 receives a request to execute a mapreduce job set to generate a cross product from a data set based on a join field. The join field indicates that each cross product is to be generated from a corresponding subset of the data set. Each subset is associated with the same key and different subsets are associated with different keys.

In some embodiments, the join field indicates that each cross product is to be generated on a per-key basis, such that each cross product is to be generated from a corresponding subset associated with the same key rather than from data of a dataset associated with a different key.

In some embodiments, a request (e.g., a single request) is received from the client device 110 requesting a mapreduce job set for generating a cross product of the data set. A data set includes one or more entries, and each entry of the data set includes data specific to the particular entry. The request may include a connection field parameter (e.g., "IP") that indicates that a cross product is to be generated over the connection field (e.g., over "IP") to create a cross product for each subset of entries of the data set associated with the same key, but not across entries of the data set associated with different keys. For example, each entry in a subset of the data set contains data "IP 1," while each entry in another subset of the data set contains data "IP 2. A cross product is generated for a subset of data associated with "IP 1" and another cross product is generated for a subset of data associated with "IP 2", but no cross product is generated for a data set that pairs an entry containing "IP 1" with an entry containing "IP 2".

At block 210, in response to receiving a request to execute a mapreduce job set to generate a cross product of the data set based on the join field, processing logic executes the mapreduce job set on the data set to generate the cross product. To execute a mapreduce job set, processing logic generates key-value pair groups from corresponding subsets of the data set. Each key-value pair of the corresponding key-value pair group comprises the same key.

It may be noted that one or more mapping operations of a mapreduce job may be used to convert entries of a dataset into key-value pairs. In some embodiments, with some data of an entry as a key, all data of the entry is retained and becomes a value in a key-value pair. The join field indicates which data of the entry is to be a key. If the key data in multiple entries is the same, the entries will generate key-value pairs with the same or identical keys. Executing the mapreduce job set will be further described with reference to fig. 3 and 4A-4F.

At block 215, processing logic stores final output data of the mapreduce job set, the final output data including cross products of the key-value pairs for each of the groups. The cross product pairs each value of the respective pair of key value groups with each remaining value of the respective pair of key value groups to form pairs of values (e.g., value pairs).

FIG. 3 is a flow diagram illustrating a method 300 for performing a mapreduce job set on a data set to generate a cross-product, in accordance with an embodiment of the present disclosure. Method 300 may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, programmable logic, microcode), software (e.g., instructions run on a processing device to perform hardware simulation), or a combination thereof. In some embodiments, cross product module 140 executing at computer cluster 150 (e.g., at one or more of nodes 132 or 134) may perform some or all of the operations. In other embodiments, the cross product module 140 executing at the collaboration platform 120, the client device 110A, the client device 110B, or the server 130, or a combination thereof, may perform some or all of the operations. It may be noted that in some embodiments, method 300 may include the same, different, fewer, or greater number of operations performed in any order.

At operation 305, processing logic executing method 300 executes a first mapping phase of a first job of the mapreduce job set using the data set to generate first intermediate data. The first intermediate data includes key-value pair groups. A particular key value has the same key to the key-value pairs of a group. Different key value pairs have different keys between pairs of groups.

At operation 310, processing logic performs a first reduction phase of a first job of the mapreduce job set using the first intermediate data to generate first output data. The first output data includes a first set of key-value pair groups having a first modification key indicating a sorting order and a number of key-value pairs in the first set of key-value pairs. The first job in the job group will be further described with reference to fig. 4A.

In some embodiments, a data set may have entries that include time ranges. In an embodiment where the dataset has entries that include a time range, to execute a first mapping stage of a first job of the mapreduce job set to generate first intermediate data using the dataset, processing logic identifies a parameter associated with a request to execute the mapreduce job set that generates a plurality of cross products. In some embodiments, the parameter indicates a time unit. For each entry of the data set, processing logic identifies a time range indicated in the entry. Processing logic increments the time range by the time unit. Incrementing starts from the earliest time identified by the time range to the last time identified by the time range. At each increment, processing logic generates a timestamp reflecting a time within the time range at the increment. Processing logic generates one or more key value pairs for the entry based on the incrementing. The keys of the one or more key-value pairs identify incremental timestamps, and the values of the one or more key-value pairs identify data of the entries. Processing logic generates key-value pair groups, wherein each key-value pair of a respective key-value pair group includes the same timestamp (e.g., the same key). The above mapping operation will be further described with reference to fig. 4F.

At operation 315, processing logic performs a second mapping phase of a second job of the mapreduce job set using the first output data of the first job to generate second intermediate data. The second intermediate data includes a first pair of key values having a first modified key. It may be noted that for clarity, operation 315 and subsequent operations 320 and 350 are described for the first key-value pair group to produce cross products for the first key-value pair group. It will be appreciated that similar operations may be performed on other key-value pair groups, where a cross product is generated for each key-value pair group. In general, a plurality of cross products are generated for a data set, each cross product being based on a particular value (e.g., connected to a particular key).

At operation 320, processing logic performs a second reduction phase of a second job of the mapreduce job set using the second intermediate data to generate second output data. The second output data includes a first subset of the set of first key values having the second modified key. The first modified key of the first pair of key value pairs is modified to generate a second modified key identifying a first subset of the first pair of key value pairs.

In some embodiments, to perform the second reduction phase of the second job of the mapreduce job set using the second intermediate data to generate the second output data, processing logic sorts the key-value pairs of the first key-value pair group in the sorted order identified by the first modification key. Processing logic identifies a number of key-value pairs in a first pair of key-value pairs indicated by an initial key of the sorted key-value pairs. Processing logic determines a number of key-value pairs for each sub-group of key-value pairs (associated with the first pair of key-value pairs). The number of key-value pairs in the first subgroup does not exceed the maximum number of key-value pairs identified in the request. Processing logic generates a second modified key for the first subset of key values such that the second modified key identifies a particular subset from the first subset. The second mapreduce job of the mapreduce job set will be further described with reference to fig. 4B.

At operation 325, processing logic performs a third mapping stage of a third job of the mapreduce job set using second output data of the second job to generate third intermediate data. The third intermediate data comprises a first subset of key-value pairs.

At operation 330, processing logic performs a third reduction phase of a third job of the mapreduce job set using third intermediate data to generate third output data. The third output data includes a second subset of key-value pairs generated from the first subset of key-value pairs. The second subset of key-value pairs includes a first subset of key-value pairs and duplicate key-value pairs of the first subset of key-value pairs, wherein at least keys of the duplicate key-value pairs are modified. A third mapreduce job of the mapreduce job set will be further described with reference to fig. 4C.

At operation 335, processing logic performs a fourth mapping stage of a fourth job of the mapreduce job set using third output data of the third job to generate fourth intermediate data. The fourth intermediate data comprises a second subset of key-value pairs.

At operation 340, processing logic performs a fourth reduction phase of a fourth job of the mapreduce job set using fourth intermediate data to generate fourth output data. Each reducer of the fourth reduction stage receives a respective key-value pair of fourth intermediate data having the same key. At each reducer, each value of the respective key-value pair is paired with each remaining value of the respective key-value pair to generate a new value (e.g., a cross product) with the new key. The fourth output data includes new key-value pairs from each reducer of the fourth job. A fourth mapreduce job of the mapreduce job set is further described with reference to fig. 4D.

At operation 345, processing logic performs a fifth mapping stage of a fifth job of the mapreduce job set using fourth output data of the fourth job to generate fifth intermediate data. The fifth intermediate data includes new key-value pairs from each reducer of the fourth job.

At operation 350, the processing logic performs a fifth reduction phase of a fifth job of the mapreduce job set to perform a deduplication operation to remove duplicate key-value pairs from each reducer's new key-value pair from the fourth job and provide a cross-product of the plurality of cross-products for the first key-value pair group. As described above, similar operations 315 and 350 may be performed on the first key-value pair group and other key-value pair groups (generated in the first operation) to generate a plurality of cross products for the data set. Each cross product is a cross product over a subset (e.g., a non-overlapping subset) of the original data set.

Fig. 4A illustrates a diagram of a first job of a mapreduce job set for generating cross products from a data set based on a connection field, according to an embodiment of the present disclosure. First mapreduce job 400 illustrates mapper 410 along with reducer 414A and reducer 414B (commonly referred to as "reducer 414"). A request is received to execute a mapreduce job set that generates a cross product from a data set based on a join field. In the illustrated example, the input 411 is received by the mapper 410. Input 411 may include a data set on which the mapreduce job set is executed. As shown, the data set includes 10 entries, each entry including data. For example, entry 416 includes data "User 1: IP2 ". The join field may be included as a parameter to a request to perform a mapreduce set that generates cross products from a data set. In the current example, the connection field is "IP" (e.g., internet protocol address), which indicates that cross products are to be generated for a subset of the data sets each having a particular value. For example, all entries with "IP 1" are included in a subset for which cross products are to be generated, and all entries with "IP 2" are included in another subset for which another cross product is to be generated. A cross product between an entry including "IP 1" and an entry including "IP 2" is not generated.

In an embodiment, mapper 410 performs a mapping task using a data set (e.g., input 411) and generates a plurality of key-value pair groups. For example, mapper 410 uses the data within each entry to identify a key based on the connection field "IP". Mapper 410 identifies a first subset of entries of a data set comprising "IP 1" and generates a key-value pair group 413A based on the first subset of entries. Mapper 410 identifies a second subset of entries of the data set that includes "IP 2" and generates a key-value pair group 413B based on the second subset of entries.

It may be noted that in some embodiments, a data set may include any number of entries. For example, a data set may include millions of entries. Thousands of key-value pair groups may be generated from the data set, where each key-value pair group is based on the same key. Each key-value pair group may in turn have thousands to millions of key-value pairs all having the same key. 4A-4F, one or more mappers may be used for each mapping stage of a corresponding mapping job and one or more reducers may be used for each reduction stage of a corresponding reduction job.

As described above, one or more mapping operations of a mapping task may be used to convert an entry of a data set to a key-value pair. In some embodiments, some data of an entry is used as a key (e.g., "IP 1" or "IP 2"), and all data of the entry is retained and becomes a value in a key-value pair. The connection field (e.g., "IP") indicates which data of the entry is to be a key (e.g., key data). If the key data (e.g., "IP 1" or "IP 2") in multiple entries is the same, then these entries will generate key-value pairs for key-value pairs groups (e.g., key-value pair group 413A or key-value pair group 413B).

The output 412 of the mapper 410 (e.g., the first intermediate data) includes a key-value pair group 413A and a key-value pair group 413B. Output 412 serves as an input (not shown) to the reduction phase of first mapreduce job 400. The key-value pairs of each group are sent to different reducers 414A and 414B according to key, such that the key-value pairs of group 413A having the same key ("IP 1") are sent to reducer 414A, and the key-value pairs of group 413B having the same key ("IP 2") are sent to reducer 414B. In an embodiment, key-value pairs having the same key are sent to a particular reducer.

During the reduction phase, reducer 414 receives output 412 from mapper 410. Reducer 414 modifies output 412 such that the key of key-value pair groups 413A and 413B is modified (e.g., the first modified key) to affect how output 415A and output 415B are sorted and regrouped (e.g., subgroup created) in the next reduction phase of second mapping job 420.

For example, the key-value pair group 413A is passed to reducer 414A. Reducer 414A performs a reduction task that counts the total number of key-value pairs in the group 413A. One of the key-value pairs of the pair 413A is modified with the sort ID417 so that in the next reduction phase of the second mapping job 420, the pair 413A is sorted so that the position of the key-value pair with the sort ID417 is known in the sort order.

In the current example, the sort ID417 includes a value that will allow the selected key-value pair in the pair of key-value pairs 413A to become the first or initial key-value pair after the sort operation. In the current example, the key "IP 1" of the last key-value pair 419 of the key-value pair group 413A is modified to ("0") to identify that the key-value pair 419 will be the first key-value pair in the sorted order. Key-value pairs other key-value pairs of the group 413A are modified with other sort IDs. The other sort IDs of the other key-value pairs are numbers greater than the sort ID417 ("0") to identify the other key-value pairs as being ranked after the last key-value pair 419 in the sort order. The other keys of the other key-value pairs are modified to "1" indicating that they will be sorted after the key-value pair 419. It may be noted that the key-value pairs 419 may contain additional information (e.g., group size ID 418) that may be used by the next reduction phase of the second mapping job 420. Knowing the location (placement in sorted order) of the key-value pairs 419 allows the next reduction phase of the second mapping job 420 to know where to look to extract information (e.g., group size ID 418) from the modified keys of the key-value pairs 419.

In an embodiment, reducer 414A counts the number of key-value pairs in group 413A. The key of the key-value pair 419 is further modified (e.g., group size ID 418) to indicate the number of key-value pairs in the group 413A of key values. In this example, there are 5 key-value pairs, and the key of the key-value pair 419 is further modified to "5" to indicate the number of key-value pairs in the key-value pair group 413A.

It may be noted that the modification key of the key-value pair group 413A is also referred to as a first modification key. The first modified key indicates the sorting order of the key-value pairs in the group 413A and the number of key-value pairs. It may also be noted that similar operations may be performed on the key-value pair group 413B. For clarity, the mapreduce job described subsequently with respect to fig. 4B-4F describes the mapping and reduction tasks performed on the key-value pair group 413A. It is to be appreciated that similar mapping and reduction tasks may be performed on the key-value pair group 413B, even if not explicitly recited.

Fig. 4B illustrates a diagram of a second job of a mapreduce job set for generating cross products from a data set based on a connection field, according to an embodiment of the present disclosure. Second map-reduce job 420 illustrates mapper 430 and reducer 424. The second mapreduce job 420 divides the key-value pair groups 413A into subgroups for more efficient processing in subsequent mapreduce jobs of the mapreduce job set. The second mapreduce job 420 creates a subgroup of the key-value pair group 413A that is no larger in size than the maximum number of key-value pairs.

Mapper 430 receives an output 415A of reducer 414A of the first mapreduce job 400 of the mapreduce job set. The output 415A includes a first pair of key values 413A with a first modified key. Output 415A from reducer 414A of first mapreduce job 400 is used as an input to mapper 430. The mapper 430 uses this input and generates second intermediate data, such as output 421. In this example, the mapper 430 does not modify the key-value pair group 413A, but reads the key-value pair group 413A and passes it to the reducer 424. The output 421 of the mapper 430 serves as the input 422 of the reducer 424. Logic may be implemented such that all key-value pairs having the same prefix (e.g., "IP 1") that includes information in the key preceding the colon are sent to the same reducer.

In an embodiment, reducer 424 performs the reduction phase of second mapreduce job 420 using intermediate data (e.g., output 421) from mapper 430 to generate output 423. Reducer 424 modifies the first modified key of the pair of key values group 413A to generate a second modified key that identifies a subgroup (e.g., subgroups 425A, 425B, and 425C) of the pair of key values group 413A.

In an embodiment, reducer 424 uses a first modifying key (specifically a sort ID) to sort the set of key values 413A in a sort order. The sort ID417 represented by "0" following the first colon among the keys of the key-value pair 419 of the key-value pair group 413A is sorted as the initial key of the sort order. Key-value pairs remaining for group 413A have larger sort IDs and are sorted after key-value pair 419.

In an embodiment, reducer 424 identifies the number of key-value pairs in the set 413A of key-value pairs indicated by the key of the initial key-value pair (key-value pair 419). For example, reducer 424 knows which position in the sort order contains the key-value pair of the key with group size ID 418, in this example the initial position. Reducer 414A parses the key of the initial key-value pair (e.g., key-value pair 419) to identify a group size ID 418 (e.g., "5") that identifies the number of key-value pairs in the group 413A.

In an embodiment, reducer 424 determines the number of key-value pairs for each of subgroups 425A, 425B, and 425C (collectively "subgroups 425"). Subgroup 425 is a non-overlapping subgroup of key-value pair group 413A. In an embodiment, a parameter (e.g., a maximum number parameter) indicates a maximum number of key-value pairs to include per subgroup 425, such that the number of key-value pairs in any one of subgroups 425 does not exceed the maximum number of key-value pairs identified by the maximum number parameter. In some embodiments, the initial request to perform the set of mapreduce operations to generate cross products includes a maximum number parameter. In other embodiments, the maximum number parameter may be predetermined and part of the script or source code that maps the reduced job set. In the current example, the maximum number parameter indicates that the maximum number of key-value pairs in subgroup 425 cannot exceed two key-value pairs.

In an embodiment, reducer 424 generates a second modified key for a subgroup of key-value pairs (e.g., subgroup 425) such that the second modified key identifies a particular subgroup from subgroup 425 and the particular subgroup does not exceed the maximum number of key-value pairs. For example, subgroup 425A includes a maximum number of key-value pairs (e.g., two key-value pairs). The keys of subgroup 425A are modified after the colon to "0-2", which indicates that the key-value pair of subgroup 425A belongs to subgroup "0" of the three subgroups. Similarly, the key of subgroup 425B is modified after the colon to "1-2", which indicates that the key-value pair of subgroup 425B belongs to subgroup "1" of the three subgroups. The key of subgroup 425C is modified after the colon to "2-2", which indicates that the key-value pair of subgroup 425C belongs to subgroup "2" of the three subgroups. It may be noted that subgroup 425C includes only one key-value pair, as no additional key-value pairs are available to populate the subgroup.

FIG. 4C illustrates a diagram of a third operation of a mapreduce job set that generates cross products from a data set based on a join field, according to an embodiment of the disclosure. A third mapreduce job 440 illustrates mapper 431 and reducers 433A, 433B, and 433C (commonly referred to as "reducer 433"). The third mapreduce job 440 makes duplicate copies of the key-value pairs of the sub-groups 425 and combines the copies of the key-value pairs with the key-value pairs of the respective sub-groups 425. In some cases, the key of the copied key-value pair is modified. The key-value pairs of repeating subgroup 425 and modifying at least the keys of the copied key-value pairs are performed so that key-value pairs having the same value (but different keys) are sent to different reducers in fourth mapreduce job 450. The fourth mapreduce job 450 will generate a cross product for the key-value pair group (e.g., key-value pair group 413A). By repeating the key-value pairs of subgroup 425 at third mapreduce job 440 and modifying the keys of at least the copied key-value pairs, the processing load is more evenly distributed across the reducers of fourth mapreduce job 450 and so that each value of the key-value pair group 413A can be paired with each remaining value of the key-value pair group 413A, even if the operations are distributed across multiple reducers. Each reducer at fourth mapreduce job 450 may receive a predetermined maximum number of key-value pairs (based at least in part on earlier generation of subgroup 425 at mapreduce job 420 to include no more than the maximum number of key-value pairs).

In an embodiment, to perform the mapping phase of the third mapreduce job 440, the mapper 431 receives the output 423 from the reducer 424 of the second mapreduce job 420. Output 423 becomes an input to mapper 431 and comprises a subgroup 425 of key-value pair group 413A. In a particular example, the mapper 431 does not modify the subgroup 425 of the key-value pair group 413A. Subgroup 425 becomes intermediate data (e.g., output 432). The intermediate data is read and passed to reducer 433 such that each subgroup 425A, 425B, and 425C is passed to a different reducer 433A, 433B, and 433C, respectively, based on the subgroup key. In an embodiment, key-value pairs having the same key are passed to the same reducer.

In an embodiment, during the reduction phase of third mapreduce job 440, reducers 433A, 433B, and 433C receive sub-groups 425A, 425B, and 425C, respectively. Subgroups 425A, 425B, and 425C are used as inputs 434A, 434B, and 434C by respective reducers 433. Reducers 433A, 433B, and 433C generate outputs 435A, 435B, and 435C, respectively. Outputs 435A, 435B, and 435C of reducer 433 include a second subset 436A, 436B, and 436C, respectively, of key-value pairs. Subgroups 436A, 436B, and 436C are generated from first subgroups 425A, 425B, and 425C. Subgroups 436A, 436B, and 436C include key-value pairs from respective subgroups 425A, 425B, and 425C and duplicate key-value pairs of respective subgroups 425A, 425B, and 425C. At least the key of the repeated key-value pair is modified.

For example, reducer 433A receives input 434A from mapper 431. Input 434A includes a key-value pair from subgroup 425A. Reducer 433A passes key-value pairs from subgroup 425A to output 435A (e.g., the second and fourth key-value pairs of subgroup 436A). Reducer 433A also repeats the key-value pairs of subgroup 425A and modifies the keys of the repeated key-value pairs (e.g., the first and third key-value pairs of subgroup 436A).

In another example, reducer 433C receives input 434C from mapper 431. Input 434C includes a key-value pair from subgroup 425C. Reducer 433C passes key-value pairs from subgroup 425C to output 435C, repeats the key-value pairs of subgroup 425C and modifies the keys of both the repeated key-value pairs and the passed key-value pairs from subgroup 425C.

It may be noted that reducer 433 modifies the key to include the reducer ID. The reducer ID may be used so that keys with the same reducer ID are sent to the same reducer in a subsequent mapreduce job (e.g., fourth mapreduce job 450). The reducer ID is displayed after the colon of the key. For example, in subgroup 436A, the first key-value pair displays a reducer ID of "0-1", the second key-value pair displays a reducer ID of "0-2", and the third key-value pair displays a reducer ID of "0-1". In some embodiments, key generation of subgroup 436 is optimized to reduce the number of duplicate cross-product records generated at fourth mapreduce job 450, while ensuring that each value of key-value pair group 413A is paired with each remaining value of key-value pair group 413.

In one example, key optimization may be shown at reducer 433A. Input 434A includes subgroup 425A. The information in the keys of sub-group 425A after the colon (i.e., "0-2") indicates the sub-group IDs and the number of sub-group IDs. For subgroup 425A, the subgroup ID is "0" and the number of subgroup IDs is "2" (where the total number of groups is 3 — groups "0", "1", and "2"). Reducer 443A will iterate over all possible subgroup IDs. For example, reducer 433A will generate a subgroup ID-subgroup ID number as follows: "0-0", "0-1" and "0-2". If the iteration of the subgroup ID-subgroup ID number does not repeat the numbers (e.g., "0-1", "0-2"), the resulting subgroup ID-subgroup ID number is used as part of the keys of subgroup 436A. If the iteration of the subgroup ID-subgroup ID number does repeat the number (e.g., "0-0"), the resulting subgroup ID-subgroup ID number is not used as part of the keys of subgroup 436A.

Fig. 4D illustrates a diagram of a fourth job of the mapreduce job set for generating cross products from the data set based on the connection fields, according to an embodiment of the present disclosure. Fourth map reduce operation 450 illustrates mapper 441 and reducers 443A, 443B, and 443C (commonly referred to as "reducers 443"). At a fourth mapreduce operation 450, each reducer 443 generates pairs of values (e.g., pairs of cross products) such that each value of a key-value pair is paired with each remaining value of the remaining key-value pairs at the particular reducer 443. In an embodiment, the fourth mapreduce job 450 effectively generates a cross product for the key-value pair group 413A. In an embodiment, the output of the reduction phase of the fourth mapreduce job 450 may include duplicate key-value pairs (e.g., cross-product pairs) that may be deduplicated at the fifth mapreduce job 460. Outputs 445A, 445B, and 445C (collectively "outputs 445") of the respective reducer 443 include new key-value pairs whose values include pairs of values (e.g., cross product pairs).

In an embodiment, output 435 of third mapreduce job 440, which includes subgroup 436 of key-value pair group 413A, is sent to mapper 441 of fourth mapreduce job 450 to be used as input to mapper 441. During the mapping phase of the fourth mapreduce job 450, the mapper 441 generates intermediate data using the output 435 of the third mapreduce job 440. The intermediate data includes a subgroup 436 (e.g., a second subgroup of key-value pairs). Mapper 441 reads the inputs (e.g., subset of key-value pairs 436) and sends each of the subset of key-value pairs 436 to a respective one of reducers 443. In an embodiment, key-value pairs with the same key are sent to the same reducer, such that subgroup 436A is sent to reducer 443A, subgroup 436B is sent to reducer 443B, and subgroup 436C is sent to reducer 443C. Subgroups 436A, 436B, and 436C serve as inputs 444A, 444B, and 444C, respectively, at respective reducers 443.

In an embodiment, reducer 443 performs the reduction phase of fourth mapreduce job 450 to generate output 445A, output 445B, and output 445C (collectively "output 445"), respectively. At the respective reducer 443, each value of the respective key-value pair of the subgroup 436 is paired with each remaining value of the respective key-value pair of the subgroup 436 to generate a new value with the new key (e.g., a new key-value pair as a cross-product pair).

For example, reducer 443C receives input 444C that includes subgroups 436C, all of which share the same keys "IP 1: 0-2". The values of the subgroups 436C are all different. Reducer 443C creates a new key-value pair, as shown by output 445C, where each value of subgroup 436C is paired with each additional value of subgroup 436C. As shown, the key-value pairs of output 445C include "[ User1: IP1, User 4: IP1 ]", "[ User1: IP1, User 5: IP1 ]" and "[ User 4: IP1, User 5: IP1 ]".

In some embodiments, the new key of the new key-value pair shown at output 445 uses a unique ID associated with an entry of the dataset from which the value of the new key-value pair originated (e.g., was derived). For example, the first key-value pair at output 445A of reducer 443A includes keys "Q1: Q5". "Q1" represents the unique ID associated with the first entry of the data set (e.g., input 411 of FIG. 4A) from which the value "User 1: IP 1" is derived, and "Q5" represents the unique ID associated with the fifth entry of the data set from which the value "User 5: IP 1" is derived. Associating the values of the new key-value pairs at output 445 with their entries in the original dataset using the unique ID facilitates deduplication of the data (e.g., deleting redundant data). For example, there may be instances where the values of different key-value pairs (e.g., the cross-product pair of the new key-value pair at output 445) are the same but are derived or derived from different entries of the dataset. In such a case, an administrator or other user may wish to retain duplicate values originating from different entries.

In some embodiments, a unique ID identifying a particular entry of the data set from which a value of a key-value pair is obtained may be associated with the value by mapping the reduction job set. For example, a unique ID associated with a value may be inserted into an entry of the data set such that the unique ID becomes part of the value of the key-value pair at the first mapreduce job 400 and effectively passes the value to multiple mapreduce jobs, as described herein. In a fourth mapreduce job 450, the unique ID may be retrieved from the corresponding value of the key-value pair and used as part of the new key. In some cases, the unique ID may be removed from the value during the reduction phase of the fourth mapreduce job 450.

In some embodiments, a physical node may execute multiple reducers. The map-reduce framework may establish a memory unit (e.g., a cache) such that each physical node may access a certain amount of memory of the memory unit to store data. In some embodiments, the memory unit associated with a physical node may be configured to store keys (or key-value pairs) created by reducers associated with a particular physical node. The memory location is available for deduplication so that the cache can be checked to see if the key already exists before writing a new key-value pair to the output of the reducer. If the key is present in the memory location, the key-value pair is not written to the output of the reducer. If the key is not present in the memory location, the key-value pair is written to the output of the reducer.

Fig. 4E illustrates a diagram of a fifth job of a mapreduce job set that generates cross products from a data set based on a connection field, according to an embodiment of the present disclosure. A fifth map reduce job 460 illustrates mapper 461 and reducer 463. At the fifth mapreduce job 460, duplicate key-value pairs are removed (e.g., deduplicated), and cross products (without duplicate key-value pairs) for the key-value pair group 413A are stored at the data store.

In an embodiment, during the mapping phase of the fifth map reduce job 460, mapper 461 receives output 445 of reducer 443 of the fourth map reduce job 450. Output 445 serves as an input to mapper 461. The mapper generates intermediate data using the input. The intermediate data includes the new key-value pair generated by reducer 443 of fourth mapreduce job 450. Mapper 461 reads the data and passes the data unaltered to reducer 463.

In an embodiment, the reduction phase of the fifth mapreduce job 460 is performed by reducer 463. The reduction phase performs a deduplication operation to remove duplicate key-value pairs from new key-value pairs. Output 462 provides the cross product of the subset of the data set (all entries with "IP 1") from which the key-value pair group 413A was derived. Output 462 is stored in a data store. It may be noted that a single reducer 463 is shown for clarity and not limitation. In some embodiments, each key-value pair with the same key in the output 462 of the mapper 461 is sent to a specific reducer. For example, key-value pairs keyed "Q1: Q5" are sent to the first reducer, and key-value pairs keyed "Q2: Q3" are sent to the second reducer. Each reducer may perform deduplication on received key-value pairs.

It may also be noted that the job set described herein includes a first mapreduce job 400, a second mapreduce job 420, a third mapreduce job 440, a fourth mapreduce job 450, and a fifth mapreduce job 460. In other embodiments, one or more mapreduce jobs may be modified. For example, FIG. 4F depicts an alternative mapping phase that may be implemented in the first mapreduce job 400. In another example, a cache may be used for mapreduce jobs 450 that facilitate performing deduplication operations. In some embodiments, one or more operations of a mapreduce job in a mapreduce job set may be combined. For example, the operations of mapreduce job 420 and mapreduce job 440 may be combined into a single mapreduce job. In some embodiments, not all mapreduce jobs in a mapreduce job set are executed. For example, in some implementations, mapreduce job 460 may be replaced with a cache-based deduplication operation described with respect to mapreduce job 450.

Fig. 4F illustrates a diagram of a mapping phase of a first job of a mapreduce job set that generates cross products from a data set that includes a time range, according to an embodiment of the disclosure. In an embodiment, mapping phase 470 may be used as an alternative to the mapping phase shown in first mapreduce job 400 of fig. 4A. In some embodiments, the entry of the data set on which the cross product is to be generated may include a temporal data type, such as a time range. For example, the entry 473 of the data set (e.g., input 472) includes the time range "12: 01-12: 02". In some cases, it may be beneficial to generate cross products of the data to see if the data overlap in time. For example, the collaboration platform 120 may determine whether multiple users are playing a game at the same time. The mapping stage 470 may create key-value pairs for different times (e.g., 12:02, 12:03) within a time range. In an embodiment, the mapping phase increments the time range in time units and creates a new key-value pair at each timestamp that is incremented.

In an embodiment, a request is received to perform a mapreduce job set that generates a cross product from a data set based on a join field. In the illustrated example, the input 472 is received by a mapper 471. Input 472 may include a data set on which the mapreduce job set is executed. As shown, the data set includes 2 entries, which include data. For example, entry 473 includes the data "User 2Session:12:01-12: 02". The connection field may be included as a parameter in a request to execute a mapreduce job set. In the current example, the connection field is a "timeframe," which indicates that a key is to be generated based on the "timeframe field" of the dataset entry and that cross products are generated for subsets of the dataset associated with the same key (e.g., the same timestamp).

In some embodiments, mapping stage 470 identifies a time unit parameter associated with the request to generate the cross product. The time unit parameter represents a time unit. In the present example, the time unit is 1 minute. The time unit may be set to any value, such as hours, days, etc.

In an embodiment, for each entry of the data set, the mapper 471 identifies the data in the time range data field in the entry. The data in the time range field may be referred to as a time range. The mapper 471 increments the time range in time units. The time range is incremented starting from the earliest time of identification of the time range to the last time of identification of the time range. At each increment, a timestamp is generated that reflects the time within the time range at the increment. The mapper 471 generates one or more key value pairs for the entry based on the incrementing. The keys of one or more key value pairs identify incremental timestamps. The values of the one or more key value pairs identify the data of the entry. In an embodiment, the mapper 471 generates key-value pair groups. Each key-value pair of the corresponding key-value pair group includes the same timestamp.

For example, entry 476 includes the time range "12: 01-12: 03" and entry 473 includes the time range "12: 01-12: 02". The time unit is one minute. For entry 476, the mapper starts with the earliest time identified by the time range "12: 01-12:03," i.e., "12: 01," and increments the earliest time by one minute (e.g., 12: 02). "12: 02" is the generated timestamp, which reflects the time in the time range at the time of the increment. The time range "12: 01-12: 03" was again incremented by one minute to "12: 03". "12: 03" is the last time the timestamp and time range identification. Timestamps "12: 02" and "12: 03" become keys, and the values are the data in the corresponding entry (e.g., "User 1Session:12:01-12: 03"). The resulting key-value pair key values group the keys in groups 475A and 475B and pass to the reducer stage of mapreduce job 400 of fig. 4A.

Fig. 5 is a block diagram illustrating an exemplary computer system 500, according to an embodiment. Computer system 500 executes one or more sets of instructions that cause the machine to perform any one or more of the methodologies discussed herein. Instruction sets, instructions, and the like, may refer to instructions that, when executed by computer system 500, cause computer system 500 to perform one or more operations of cross product module 140. The machine may operate in the capacity of a server or a client device in a client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine may be a Personal Computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a mobile telephone, a web appliance, a server, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set of instructions to perform any one or more of the methodologies discussed herein.

Computer system 500 includes a processing device 502, a main memory 504 (e.g., Read Only Memory (ROM), flash memory, Dynamic Random Access Memory (DRAM), such as synchronous DRAM (sdram) or Rambus DRAM (RDRAM), etc.), a static memory 506 (e.g., flash memory, Static Random Access Memory (SRAM), etc.), and a data storage device 516, which communicate with each other via a bus 508.

Processing device 502 represents one or more general-purpose processing devices such as a microprocessor or central processing unit or the like. More specifically, the processing device 502 may be a Complex Instruction Set Computing (CISC) microprocessor, Reduced Instruction Set Computing (RISC) microprocessor, Very Long Instruction Word (VLIW) microprocessor, or a processing device implementing other instruction sets or a processing device implementing a combination of instruction sets. The processing device 502 may also be one or more special-purpose processing devices such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), network processor, or the like. The processing device 502 is configured to execute instructions of the system architecture 100 and the cross-product module 140 to perform the operations discussed herein.

The computer system 500 may also include a network interface device 522 that provides communication with other machines over a network 518, such as a Local Area Network (LAN), an intranet, an extranet, or the internet. Computer system 500 may also include a display device 510 (e.g., a Liquid Crystal Display (LCD) or a Cathode Ray Tube (CRT)), an alphanumeric input device 512 (e.g., a keyboard), a cursor control device 514 (e.g., a mouse), and a signal generation device 520 (e.g., a speaker).

The data storage device 516 may include a non-transitory computer-readable storage medium 524 on which is stored a set of instructions of the system architecture 100 and the cross-product module 140, the set of instructions embodying any one or more of the methods or operations described herein. The instruction sets of system architecture 100 and cross-product module 140 may also reside, completely or at least partially, within the main memory 504 and/or within the processing device 502 during execution thereof by computer system 500, main memory 504 and processing device 502 also constituting computer-readable storage media. The set of instructions may also be transmitted or received over a network 518 via the network interface device 522.

While the example of the computer-readable storage medium 524 is shown as a single medium, the term "computer-readable storage medium" may include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the sets of instructions. The term "computer-readable storage medium" may include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The term "computer readable storage medium" can include, but is not limited to, solid-state memories, optical media, and magnetic media.

In the preceding description, numerous details have been set forth. However, it will be apparent to one having ordinary skill in the art having had the benefit of the present disclosure that the present disclosure may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present disclosure.

Some portions of the detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, considered to be a self-consistent sequence of operations leading to a desired result. The sequence of operations is those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, it may be appreciated that throughout the description, discussions utilizing terms such as "hosting," "determining," "receiving," "providing," "sending," "identifying," "monitoring," "adding," or "executing" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (e.g., electronic) quantities within the computer system memories or registers into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any type of media suitable for storing electronic instructions.

The word "example" or "exemplary" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" or "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word "example" or "exemplary" is intended to present concepts in a concrete fashion. As used in this application, the term "or" is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless otherwise specified, or clear from context, "X comprises a or B" is intended to mean any of the natural inclusive permutations. That is, if X comprises A; x comprises B; or X includes A and B, then "X includes A or B" is satisfied under any of the above circumstances. In addition, the articles "a" and "an" as used in this application and the appended claims should generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form. Furthermore, the use of the terms "implementation" or "one implementation" or "an embodiment" or "one embodiment" or the like throughout is not intended to denote the same implementation or embodiment, unless so described. One or more implementations or embodiments described herein may be combined in a particular implementation or embodiment. The terms "first," "second," "third," "fourth," and the like as used herein are intended as labels to distinguish between different elements and may not necessarily have the sequential meaning as dictated by their number.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Other embodiments will be apparent to those of skill in the art upon reading and understanding the above description. The scope of the disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

In further embodiments, one or more processing devices are disclosed for performing the operations of the above-described embodiments. Further, in embodiments of the present disclosure, a non-transitory computer-readable storage medium stores instructions for performing operations of the described embodiments. Also in other embodiments, systems for performing the operations of the described embodiments are also disclosed.

34页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：提供协作智能和约束计算的数据隐私管道

Computing cross products using mapreduce

相关技术

网友询问留言