Data processing method and device, computer equipment and storage medium

文档序号：1904786 发布日期：2021-11-30 浏览：6次中文

阅读说明：本技术 数据处理方法、装置、计算机设备及存储介质 (Data processing method and device, computer equipment and storage medium ) 是由容汉铿曾凡聂利权于 2021-03-11 设计创作，主要内容包括：本申请公开了一种数据处理方法、装置、计算机设备及存储介质,属于网络技术领域。本申请通过累计目标时间段内的各个网页请求,并基于各个网页请求所访问的网页TGI指数,初步筛选出符合第一目标条件的各个第一网页,并第一网页中二次筛选出符合第二目标条件的各个第二网页,大大降低了针对具有高业务价值的网页的挖掘工作的人力成本,能够自动化地从海量的网页请求中挖掘出具有高业务价值的网页,从而提升了针对网页请求数据的处理效率。(The application discloses a data processing method, a data processing device, computer equipment and a storage medium, and belongs to the technical field of networks. According to the method and the device, the webpage requests in the target time period are accumulated, the first webpages meeting the first target conditions are primarily screened out based on the webpage TGI indexes visited by the webpage requests, the second webpages meeting the second target conditions are secondarily screened out from the first webpages, the labor cost of the excavation work of the webpages with high business values is greatly reduced, the webpages with high business values can be automatically excavated from the massive webpage requests, and therefore the processing efficiency of webpage request data is improved.)

1. A method of data processing, the method comprising:

determining a plurality of web pages accessed by a plurality of web page requests based on the plurality of web page requests within a target time period;

acquiring a plurality of Target Group Indexes (TGIs) corresponding to the plurality of webpages respectively, wherein the TGI of any webpage in the plurality of webpages is used for expressing the importance degree of any webpage in the plurality of webpages;

determining a plurality of first web pages from the plurality of web pages based on the plurality of TGIs of the plurality of web pages, the TGIs of the plurality of first web pages meeting a first target condition;

and determining a plurality of second webpages from the plurality of first webpages based on a plurality of characteristic importance parameters of the plurality of first webpages, wherein the characteristic importance parameters of the plurality of second webpages meet a second target condition.

2. The method of claim 1, wherein the obtaining the plurality of target population indexes (TGIs) corresponding to the plurality of web pages comprises:

determining at least one first webpage request triggered by at least one first account and at least one second webpage request triggered by at least one second account from the plurality of webpage requests;

for any web page in the plurality of web pages, determining a first TGI triggered by the at least one first account and accessing the any web page based on the at least one first web page request;

determining a second TGI triggered by the at least one second account and accessing the any webpage based on the at least one second webpage request;

and determining the TGI corresponding to any webpage based on the first TGI and the second TGI.

3. The method of claim 2, wherein the determining, based on the at least one first web page request, a first TGI triggered by the at least one first account number and accessing the any web page comprises:

determining at least one third webpage request for accessing any webpage from the at least one first webpage request;

determining a ratio between the number of the at least one third web page request and the number of the at least one first web page request as the first TGI.

4. The method of claim 2, wherein the determining a second TGI triggered by the at least one second account and accessing the any web page based on the at least one second web page request comprises:

determining at least one fourth webpage request for accessing any webpage from the at least one second webpage request;

determining a ratio between the number of the at least one fourth web page request and the number of the at least one second web page request as the second TGI.

5. The method of claim 2, wherein determining the TGI corresponding to the any web page based on the first TGI and the second TGI comprises:

determining a ratio between the first TGI and the second TGI as the TGI corresponding to the any webpage.

6. The method of claim 1, wherein prior to determining a plurality of second web pages from the plurality of first web pages based on a plurality of feature importance parameters of the plurality of first web pages, the method further comprises:

acquiring a first access frequency of at least one first account to the plurality of first webpages respectively and a second access frequency of at least one second account to the plurality of first webpages respectively based on webpage requests corresponding to the plurality of first webpages;

training to obtain an account classification model based on the first access frequency and the second access frequency, wherein the account classification model is used for identifying account categories of input accounts, and the account categories at least comprise the first account and the second account;

based on the account classification model, the characteristic importance parameters of the first webpages are obtained, and the characteristic importance parameters are used for representing the importance degree of the access frequency of the input account to any first webpage compared with the access frequency of the input account to other first webpages in the process of identifying the account category.

7. The method of claim 6, wherein training an account classification model based on the first frequency of visits and the second frequency of visits comprises:

constructing a positive sample frequency characteristic based on the first access frequency;

constructing a negative sample frequency characteristic based on the second access frequency;

and iteratively training an initial classification model based on the positive sample frequency characteristics and the negative sample frequency characteristics to obtain the account classification model.

8. A data processing apparatus, characterized in that the apparatus comprises:

a first determining module, configured to determine, based on a plurality of web page requests within a target time period, a plurality of web pages visited by the plurality of web page requests;

the first acquisition module is used for acquiring a plurality of Target Group Indexes (TGIs) corresponding to the plurality of webpages respectively, wherein the TGI of any webpage in the plurality of webpages is used for representing the importance degree of any webpage in the plurality of webpages;

a second determining module for determining a plurality of first web pages from the plurality of web pages based on the plurality of TGIs of the plurality of web pages, the TGIs of the plurality of first web pages meeting a first target condition;

the second determining module is further configured to determine a plurality of second webpages from the plurality of first webpages based on a plurality of feature importance parameters of the plurality of first webpages, where the feature importance parameters of the plurality of second webpages meet a second target condition.

9. A computer device, characterized in that the computer device comprises one or more processors and one or more memories in which at least one computer program is stored, the at least one computer program being loaded and executed by the one or more processors to implement the data processing method according to any one of claims 1 to 7.

10. A storage medium having stored therein at least one computer program which is loaded and executed by a processor to implement the data processing method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of network technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.

Background

With the development of network technology, a user can send an HTTP (Hypertext Transfer Protocol) request to a server at any time and any place through an application installed on a terminal to load a web resource provided by the server. In a service scene, a server receives a large number of HTTP requests of users every day, and how to analyze data of the large number of HTTP requests to mine a webpage resource with a high service value gradually becomes a research hotspot.

At present, a manual method can be adopted to analyze data of an HTTP request, and a user analyzes a related traffic packet through a bursusite (an integrated platform for attacking a web application) or a Wireshark (an application for analyzing a network packet) and other tools based on understanding of a service to dig out a webpage resource with a high service value.

Disclosure of Invention

The embodiment of the application provides a data processing method and device, computer equipment and a storage medium, which can reduce the labor cost in the HTTP request data processing process and improve the HTTP request data processing efficiency. The technical scheme is as follows:

in one aspect, a data processing method is provided, and the method includes:

determining a plurality of web pages accessed by a plurality of web page requests based on the plurality of web page requests within a target time period;

In one aspect, a data processing apparatus is provided, the apparatus comprising:

a first determining module, configured to determine, based on a plurality of web page requests within a target time period, a plurality of web pages visited by the plurality of web page requests;

In one possible implementation, the first obtaining module includes:

a first determining unit, configured to determine, from the multiple web page requests, at least one first web page request triggered by at least one first account and at least one second web page request triggered by at least one second account;

a second determining unit, configured to determine, for any web page of the multiple web pages, based on the at least one first web page request, a first TGI that is triggered by the at least one first account and accesses the any web page;

a third determining unit, configured to determine, based on the at least one second web page request, a second TGI that is triggered by the at least one second account and accesses the any web page;

and a fourth determining unit, configured to determine a TGI corresponding to the any web page based on the first TGI and the second TGI.

In one possible implementation, the second determining unit is configured to:

determining at least one third webpage request for accessing any webpage from the at least one first webpage request;

determining a ratio between the number of the at least one third web page request and the number of the at least one first web page request as the first TGI.

In one possible implementation, the third determining unit is configured to:

determining at least one fourth webpage request for accessing any webpage from the at least one second webpage request;

determining a ratio between the number of the at least one fourth web page request and the number of the at least one second web page request as the second TGI.

In one possible implementation, the fourth determining unit is configured to:

determining a ratio between the first TGI and the second TGI as the TGI corresponding to the any webpage.

In one possible embodiment, the apparatus further comprises:

a second obtaining module, configured to obtain, based on the web page requests corresponding to the multiple first web pages, first access frequencies of the at least one first account on the multiple first web pages, respectively, and second access frequencies of the at least one second account on the multiple first web pages, respectively;

a training module, configured to train to obtain an account classification model based on the first access frequency and the second access frequency, where the account classification model is used to identify account categories of input accounts, and the account categories at least include the first account and the second account;

a third obtaining module, configured to obtain, based on the account classification model, the feature importance parameters of the first webpages, where the feature importance parameters are used to indicate an importance degree of access frequency of an input account to any one first webpage compared with access frequency of the input account to other first webpages in a process of identifying an account category.

In one possible embodiment, the training module is configured to:

constructing a positive sample frequency characteristic based on the first access frequency;

constructing a negative sample frequency characteristic based on the second access frequency;

In one aspect, a computer device is provided, the computer device comprising one or more processors and one or more memories, the one or more memories storing therein at least one computer program, the at least one computer program being loaded and executed by the one or more processors to implement the data processing method of any one of the possible implementations as described above.

In one aspect, a storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to implement the data processing method according to any one of the above possible implementations.

In one aspect, a computer program product or computer program is provided that includes one or more program codes stored in a computer readable storage medium. The one or more processors of the computer device can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute the data processing method of any one of the above-mentioned possible embodiments.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

by accumulating the webpage requests in the target time period and based on the webpage TGI indexes visited by the webpage requests, the first webpages conforming to the first target conditions are primarily screened out, and the second webpages conforming to the second target conditions are secondarily screened out from the first webpages, the labor cost of the excavation work of the webpages with high business values is greatly reduced, the webpages with high business values can be automatically excavated from massive webpage requests, and the processing efficiency of webpage request data is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to be able to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a data processing method according to an embodiment of the present application;

fig. 2 is a flowchart of a data processing method provided in an embodiment of the present application;

fig. 3 is a flowchart of a data processing method provided in an embodiment of the present application;

FIG. 4 is a block diagram of a data processing platform according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.

The term "at least one" in this application means one or more, and the meaning of "a plurality" means two or more, for example, a plurality of first locations means two or more first locations.

The embodiment of the present application relates to a big data analysis processing process in the cloud technology field, and therefore before introducing the embodiment of the present application, some basic concepts in the cloud technology field are introduced first, and the following description is given.

Cloud Technology (Cloud Technology): the cloud computing business mode management system is a management technology for unifying series resources such as hardware, software, networks and the like in a wide area network or a local area network to realize data calculation, storage, processing and sharing, namely is a general name of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like applied based on a cloud computing business mode, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support in the field of cloud technology. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can be realized through cloud computing.

Big data (Big data): the method refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and flow optimization capability only by a new processing mode. With the advent of the cloud era, big data has attracted more and more attention, and the big data needs special technology to effectively process a large amount of data within a tolerance elapsed time. The method is suitable for the technology of big data, and comprises a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system.

The embodiments of the present application also relate to machine learning techniques in the field of artificial intelligence, and some basic concepts in the field of artificial intelligence are introduced below.

Artificial Intelligence (AI): artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML): machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like. The scheme provided by the embodiment of the application relates to the technology such as machine learning of artificial intelligence, and is explained by the following embodiment.

Fig. 1 is a schematic diagram of an implementation environment of a data processing method according to an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102, and is described as follows:

the terminal 101 is installed and operated with applications supporting web services, including but not limited to: a browser application, a social application, an ordering application, a payment application, a taxi-taking application, a short video application, etc. In an example, the terminal 101 may directly send a web page request for accessing a certain web page to the server 102 through a browser application, and in another example, the terminal 101 may also send a web page request for accessing a certain web page to the server 102 through an embedded program built in a social application, which is not specifically limited in this embodiment of the present application.

The terminal 101 may be directly or indirectly connected to the server 102 through wired or wireless communication, and the connection manner is not limited in this embodiment of the application.

The server 102 is configured to provide a background service for the application program, and the server 102 includes at least one of a server, a plurality of servers, a cloud computing platform, or a virtualization center. Optionally, the server 102 undertakes primary computational work and the terminal 101 undertakes secondary computational work; or, the server 102 undertakes the secondary computing work, and the terminal 101 undertakes the primary computing work; alternatively, the server 102 and the terminal 101 perform cooperative computing by using a distributed computing architecture.

In some embodiments, the server 102 is an independent physical server, or a server cluster or distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, web services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), big data and artificial intelligence platforms, and the like.

In some embodiments, the terminal 101 is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, an MP3(Moving Picture Experts Group Audio Layer III, mpeg Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, mpeg Audio Layer 4) player, an e-book reader, and the like, but is not limited thereto.

Those skilled in the art will appreciate that the number of terminals 101 described above may be greater or fewer. For example, the number of the terminals 101 may be only one, or the number of the terminals 101 may be several tens or hundreds, or more. The number and the device type of the terminals 101 are not limited in the embodiment of the present application.

Fig. 2 is a flowchart of a data processing method according to an embodiment of the present application. Referring to fig. 2, the embodiment is applied to a computer device, and the following description takes the computer device as a server as an example, and the embodiment includes the following steps:

201. the server determines a plurality of web pages accessed by the plurality of web page requests based on the plurality of web page requests within the target time period.

Optionally, the server may be a website server, or may be a server cluster integrated by multiple servers and used for providing multiple services, where the multiple services at least include a web service, and this is not specifically limited in this embodiment of the present application.

In some embodiments, the server may collect, by using a data collection module, the received web page requests sent by each terminal in real time, and store the collected web page requests in a target database, where the data collection module may be a traffic data collection and analysis platform, and the target database may be a data storage and analysis system. Alternatively, in the target database, the received web page requests in each day are accumulated by taking each day as a partition, for example, in a segment page type storage structure, the received web page requests in the same day are stored in the same storage page, in a Key Value type storage structure, the date of each day is used as a Key name (Key), and the received web page requests in each day are used as a Key Value (Value), so that the web page requests are stored in a Key Value pair manner.

In some embodiments, the server reads all the web page requests in the target time period from the target database as the multiple web page requests, or the server randomly selects a target number of web page requests from all the web page requests in the target time period in the target database as the multiple web page requests, or the server selects a target number of web page requests from all the web page requests in the target time period in the target database as the multiple web page requests according to a preset policy, which is not specifically limited in this embodiment of the present application. The target time period may be any time period, for example, the target time period is from the first N (N ≧ 0) days to the current day, or the target time period is from the first day of the current month to the current day, or the target time period is the last week.

Optionally, the target policies include, but are not limited to: the method comprises the steps of selecting a webpage request with the data volume larger than the target data volume, or selecting a webpage request with the access frequency larger than a frequency threshold, or selecting a webpage request carrying a specified character string, and the like, wherein a target strategy can be set by a developer in a self-defining mode, and the content of the target strategy is not specifically limited in the embodiment of the application.

In some embodiments, after the server obtains the plurality of web page requests from the target database, the data preprocessing module may be used to perform data preprocessing on the plurality of web page requests. Optionally, the data preprocessing operation includes: for any webpage request in the webpage requests, the server reads a first target field and a second target field of the webpage request; and carrying out character string splicing on the first target field and the second target field to obtain the http _ req of the webpage accessed by the any webpage request.

In some embodiments, the first target field is used to indicate a website name requested to be accessed by the any web page request, for example, the first target field may be a HOST field of an HTTP request, the second target field is used to indicate a page name requested to be accessed by the any web page request, for example, the second target field may be a CGI (Common Gateway Interface) field of an HTTP request.

In some embodiments, after the server splices the character strings to obtain the http _ req, the server may further replace consecutive numbers in two '/'s in the http _ req with a 'NUM' character string, so that subsequent http _ req data to be processed can be simplified, and subsequent traffic mining work with high traffic value is facilitated. Of course, the server may not perform the above-mentioned step of replacing consecutive numbers, which can simplify the data processing flow.

202. The server acquires a plurality of Target Group Indexes (TGIs) corresponding to the plurality of web pages respectively, wherein the TGI of any one of the plurality of web pages is used for representing the importance degree of any one of the web pages in the plurality of web pages.

Wherein, TGI refers to a Target Group Index (Target Group Index), which is an Index reflecting the strength or weakness of a Target Group in a specific research range and represents the difference condition of the user attention problems with different characteristics.

In some embodiments, taking any one of the multiple web pages as an example, the server may obtain a ratio between the number of web page requests accessing the any one web page and the number of the multiple web page requests as a TGI of the any one web page, so that the TGI can be used to reflect the importance of the certain web page in all the web pages, and the TGI of the any one web page may be expressed as the following formula:

wherein, num_{http_req}Indicating the number of times of access to any web page http _ req (i.e. the number of web page requests to access the web page http _ req), num_allIndicating the number of accesses (i.e., the number of the multiple web page requests) for all http _ req.

In some embodiments, based on the basis of the TGI formula, the TGI of each web page for different account numbers in a specific service scenario may be obtained respectively, and the comprehensive TGI of each web page may be obtained based on the TGI of each web page for different account numbers.

That is, the server may determine, from the plurality of web page requests, at least one first web page request triggered by at least one first account and at least one second web page request triggered by at least one second account; for any webpage in the multiple webpages, determining a first TGI which is triggered by the at least one first account and accesses the any webpage based on the at least one first webpage request; determining a second TGI triggered by the at least one second account and accessing the any webpage based on the at least one second webpage request; and determining the TGI corresponding to any webpage based on the first TGI and the second TGI.

Optionally, the first account and the second account belong to different account categories, and whether a certain user account is the first account or the second account may be divided according to a service situation in a specific service scenario. For example, the user account is classified according to the consumption amount, or the user account is classified according to the daily active time, or the user account is classified according to the total access times, which is not specifically limited in this embodiment of the present application.

In one example, the user accounts are classified according to consumption amount, the user accounts with consumption amount larger than or equal to a target amount are acquired as a first account, and the user accounts with consumption amount smaller than the target amount are acquired as a second account. Because the first account and the second account have different business characteristics, that is, represent that a first user corresponding to the first account and a second user corresponding to the second account are different user groups, in the embodiment of the present application, the first user is referred to as a conversion crowd, and the second user is referred to as a common crowd.

Optionally, the server may divide the first account and the second account according to the service condition, or the server may further acquire all the first accounts from a third-party organization related to the specified service scenario, and acquire user accounts other than the first accounts as the second accounts, which is not specifically limited in this embodiment of the present application.

In some embodiments, upon obtaining the first TGI, the server may determine, from the at least one first web page request, at least one third web page request to access the any web page; determining a ratio between the number of the at least one third web page request and the number of the at least one first web page request as the first TGI.

In the above process, the server obtains, based on a first web page request triggered by the first account, a first TGI of the any web page relative to the first account, where the first TGI is also capable of representing a TGI index of the any web page in a conversion crowd corresponding to the first account.

In some embodiments, the server may determine, from the at least one second web page request, at least one fourth web page request to access the any web page when obtaining the second TGI; determining a ratio between the number of the at least one fourth web page request and the number of the at least one second web page request as the second TGI.

In the above process, the server obtains, based on a second web page request triggered by the second account, a second TGI of the any web page relative to the second account, where the second TGI also can represent a TGI index of the any web page in a general population corresponding to the second account.

In some embodiments, when obtaining the composite TGI, the server may determine a ratio between the first TGI and the second TGI as the TGI corresponding to the any web page. That is, the final actual TGI of any web page is represented as the ratio of the first TGI of any web page in the conversion crowd to the second TGI of any web page in the general crowd, and can be represented by the following formula:

wherein the content of the first and second substances,the first TGI that represents the http _ req of any web page in the conversion population,express the arbitrary web page http _ req is the second TGI in the general population.

In the above process, by using the web page requests accumulated in the target time period, the importance degree of each web page in all web pages can be reflected through the TGI index by obtaining the TGI index for each web page, so that the web pages with higher business value can be mined based on the TGI index (i.e. high-value traffic is mined), further, when the TGI index of each web page is obtained, the first TGI and the second TGI are respectively obtained by aiming at the first account and the second account, and the ratio of the first TGI to the second TGI is used as the final TGI index of each web page, so that the TGI index has stronger expression capacity, and the importance degree of each web page among different user groups can be reflected.

203. The server determines a plurality of first web pages from the plurality of web pages based on the plurality of TGIs of the plurality of web pages, the TGIs of the plurality of first web pages meeting a first target condition.

In some embodiments, the first target condition is that the TGI is greater than the TGI threshold, at which point the server may determine, from the plurality of web pages, respective web pages having a TGI greater than the TGI threshold as the first plurality of web pages. The TGI threshold can be set by a developer in a customized manner, and can be any value greater than or equal to 0 and less than or equal to 100, for example, the TGI threshold is 80.

In some embodiments, the first target condition is that the TGI ranking is located at the top K (topK, K ≧ 1), and then the server may rank the web pages in the order of TGI from large to small, and determine the web page ranked at the top K as the first web pages.

In the above process, through the difference in the values of the TGI indexes, each webpage can be primarily screened to screen out each first webpage, which is also a potential webpage with high business value, and then a final second webpage with high business value is secondarily screened out on the basis of each first webpage through the following step 204.

In some embodiments, in addition to screening the first web page with TGI, the server may obtain a plurality of population coverage for the plurality of web pages since in a given business scenario, it may only be necessary to focus on some web pages with large traffic. Optionally, the crowd coverage amount is used to indicate the number of account numbers in the at least one first account number for accessing each webpage, that is, the crowd coverage amount is used to indicate the coverage degree of each webpage in the conversion crowd.

In some embodiments, the first target condition is that the TGI is greater than the TGI threshold and the crowd coverage is greater than the coverage threshold, at which time the server may determine each web page having a TGI greater than the TGI threshold and the crowd coverage greater than the coverage threshold as the first plurality of web pages.

In some embodiments, the first target condition is that the TGI is greater than the TGI threshold and the crowd coverage is greater than the coverage threshold, and at this time, the server may obtain a plurality of crowd coverage of the plurality of web pages based on a plurality of crowd coverage of the plurality of web pages, where the crowd coverage of each web page is a ratio between the crowd coverage of each web page and the number of account numbers of all the first account numbers. At this time, the server may determine, as the first web pages, web pages having TGI greater than the TGI threshold and crowd coverage greater than the coverage threshold.

In some embodiments, the first target condition is that the TGI rank is located at the top K bits and the crowd coverage is greater than the coverage threshold, and at this time, the server may determine each web page whose TGI rank is located at the top K bits and the crowd coverage is greater than the coverage threshold as the plurality of first web pages.

It should be noted that the first target conditions described above are only exemplary illustrations, and a developer may set different first target conditions according to business requirements, and the content of the first target conditions is not specifically limited in the embodiments of the present application.

204. The server determines a plurality of second webpages from the plurality of first webpages based on a plurality of characteristic importance parameters of the plurality of first webpages, wherein the characteristic importance parameters of the plurality of second webpages meet a second target condition.

In some embodiments, before the server filters the second web pages from the first web pages, the server may obtain the feature importance parameter of each first web page by: acquiring a first access frequency of at least one first account to the plurality of first webpages respectively and a second access frequency of at least one second account to the plurality of first webpages respectively based on webpage requests corresponding to the plurality of first webpages; training to obtain an account classification model based on the first access frequency and the second access frequency, wherein the account classification model is used for identifying account types of input accounts, and the account types at least comprise the first account and the second account; based on the account classification model, the feature importance parameters of the first webpages are obtained, and the feature importance parameters are used for representing the importance degree of the access frequency of the input account to any first webpage compared with the access frequency of the input account to other first webpages in the process of identifying the account category.

In some embodiments, the server may count a first access frequency of each first account to each first web page, count a second access frequency of each second account to each second web page, and based on the first access frequency and the second access frequency, may construct a frequency feature of the first account and a frequency feature of the second account, respectively, so as to use the frequency feature of the first account as a positive sample and the frequency feature of the second account as a negative sample, and train to obtain a binary model, that is, an account classification model.

In some embodiments, the server may construct a positive sample frequency feature based on the first access frequency when training the account classification model; constructing a negative sample frequency characteristic based on the second access frequency; and iteratively training an initial classification model based on the positive sample frequency characteristic and the negative sample frequency characteristic to obtain the account classification model.

Optionally, for each first account, the server constructs a frequency feature vector according to the first access frequency of each first account to all first webpages, and repeats the above steps for each first account, so as to obtain the respective frequency feature vectors of all first accounts. Similarly, for each second account, the server constructs a frequency characteristic vector according to the second access frequency of each second account to all the first webpages, and repeatedly executes the steps for each second account, so that the respective frequency characteristic vectors of all the second accounts can be obtained.

Optionally, taking an account classification model as an example of a binary classification model, the server sets a first account as a positive sample, sets a second account as a negative sample, inputs frequency feature vectors of all accounts (including the first account and the second account) into an initial classification model, predicts an account category of each account through the initial classification model to obtain a prediction result of each account, compares the prediction result of each account with a real result of each account, obtains a loss function value of the iteration process, adjusts parameters of the initial classification model if a stop condition is not met, and iteratively executes the above steps until the iteration is stopped when the stop condition is met, so as to obtain an account classification model.

In some embodiments, the account classification model may be a classification model having any architecture, such as GB (Gradient Boosting), GBDT (Gradient Boosting Decision Tree), XGBoost (eXtreme Gradient Boosting), LightGBM (Light Gradient Boosting Machine), and the like.

Optionally, the stop condition is that the loss function value is smaller than the loss threshold, or the stop condition is that the iteration number is greater than the target number, which is not specifically limited in this embodiment of the application.

In some embodiments, the server may employ a loss function in different forms, for example, a loss function in the form of a mean square error, or a loss function in the form of cross entropy, and the like, which is not specifically limited in this embodiment.

In an exemplary embodiment, taking the account Classification model as an XGBoost model as an example, the XGBoost model is a strong learner integrated by a plurality of weak learners, where the weak learner may be a CART (Classification And Regression Tree) or a linear classifier (gblinear), And this is not specifically limited in this embodiment of the present application. The XGboost model can reduce variance and deviation and improve prediction effect, and mainly comprises machine learning algorithms such as Boosting algorithm, Bagging algorithm and Stacking algorithm.

For the XGboost model, the server inputs the frequency characteristic vector of any account number into the XGboost model, namely inputs the frequency characteristic vector of any account number into the weak learners, each weak learner performs characteristic splitting on the frequency characteristic vector of any account number to obtain leaf nodes of a decision tree where the weak learner is located, therefore, the weak learner can output the leaf node scores of any account, finally, the server performs weighting processing on the leaf node scores output by the weak learners, a prediction probability of whether any account belongs to the first account can be obtained, and optionally, the prediction probability is a numerical value greater than or equal to 0 and less than or equal to 1, when the prediction probability is higher, the probability that any account belongs to the first account is higher according to the XGboost model, when the prediction probability is smaller, the XGBoost model has a smaller possibility of predicting that any account belongs to the first account. Alternatively, each decision tree may be a binary tree, that is, each weak learner is divided into two parts, namely, a left sub-tree and a right sub-tree when performing feature splitting.

In some embodiments, after the account classification model is trained, a feature importance function may be called to obtain a feature importance parameter of each first web page, and taking the XGBoost model as an example, the feature importance function may be an xgb.

In some embodiments, the feature importance parameter includes, but is not limited to, the following types:

(1) the gain (gain) parameter, meaning the relative contribution of the corresponding feature to the model calculated by taking the contribution of each feature to each decision tree in the model. A higher value of this gain parameter compared to other features means that it is more important for generating the prediction.

(2) Coverage (cover) parameter, refers to the relative number of observations associated with this function. For example, if there are 100 observations, 4 features and 3 trees, and assume that feature 1 is used to decide the leaf nodes of 10, 5 and 2 observations in tree 1, tree 2 and tree 3, respectively, then the coverage parameter will calculate the coverage of this function as 10+5+ 2-17 observations. This would be calculated for all 4 functions and the coverage indicator for all functions would be expressed in 17 percent.

(3) The frequency (freq) parameter, is a percentage representing the relative number of times a particular feature occurs in the model tree. In the above example, if feature 1 occurs 2 splits, 1 split and 3 splits in each of tree 1, tree 2 and tree 3, then the weight of feature 1 would be 2+1+3 — 6. The frequency of feature 1 is calculated as its percentage weight over the weight of all features.

Alternatively, the server may adopt any one of a gain parameter, a coverage parameter, or a frequency parameter as the feature importance parameter for indicating how to filter the second web page from the first web page.

In some embodiments, the second target condition is that the feature importance parameter is greater than the feature importance threshold, and the server may determine, from the plurality of first web pages, each first web page whose feature importance parameter is greater than the feature importance threshold as the plurality of second web pages. Wherein, the feature importance threshold can be set by a developer in a self-defined way.

In some embodiments, the second target condition is that the feature importance parameter is ranked at the top T (topT, T ≧ 1), and then the server may rank the first web pages in the descending order of the feature importance parameter, and determine the first web page ranked at the top T as the plurality of second web pages.

In the above process, by taking different values of the characteristic importance parameter index, each first webpage obtained by the preliminary screening in step 203 may be subjected to secondary screening to obtain each second webpage, which is also a webpage with high business value finally mined, and of course, may also be referred to as a high-value traffic obtained by mining.

In some embodiments, the mined second webpages can be applied to flow analysis work aiming at a specified business scene, so that guidance on how to better implement business promotion and content recommendation is facilitated, and more common people can be converted into conversion people.

All the above optional technical solutions can be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

According to the method provided by the embodiment of the application, the webpage requests in the target time period are accumulated, the first webpages conforming to the first target conditions are primarily screened out based on the webpage TGI indexes visited by the webpage requests, and the second webpages conforming to the second target conditions are secondarily screened out from the first webpages, so that the labor cost of the excavation work of the webpages with high business values is greatly reduced, the webpages with high business values can be automatically excavated from mass webpage requests, and the processing efficiency of webpage request data is improved.

Fig. 3 is a flowchart of a data processing method according to an embodiment of the present application, please refer to fig. 3, which is applied to a computer device, and is described by taking the computer device as a server as an example, where the embodiment includes the following steps:

301. the server determines a plurality of web pages accessed by the plurality of web page requests based on the plurality of web page requests within the target time period.

Step 301 is similar to step 201 and will not be described herein.

302. The server determines at least one first webpage request triggered by at least one first account and at least one second webpage request triggered by at least one second account from the plurality of webpage requests.

In some embodiments, for any web page request in the multiple web page requests, the server reads a third target field in the any web page request, so as to obtain a user account triggering the any web page request, at this time, whether the user account is a first account or a second account may be automatically divided according to a service condition, or, after the server obtains a list of all the first accounts from a third party organization related to the specified service scenario, whether the user account is located in the list is queried, if the user account hits any column in the list, it is determined that the user account is the first account, otherwise, if the user account does not hit all columns in the list, it is determined that the user account is the second account. Optionally, the third target field may be a User identification (User ID) field, which is not specifically limited in this embodiment of the application.

In some embodiments, the server may determine in advance whether the user account corresponding to each web page request is the first account in the data preprocessing work, so that a subsequent data analysis process can be simplified.

303. The server determines, for any web page of the plurality of web pages, a first TGI triggered by the at least one first account and accessing the any web page based on the at least one first web page request.

In some embodiments, the server determines at least one third web page request for accessing the any web page from the at least one first web page request; determining a ratio between the number of the at least one third web page request and the number of the at least one first web page request as the first TGI.

304. The server determines a second TGI triggered by the at least one second account and accessing the any web page based on the at least one second web page request.

In some embodiments, the server determines at least one fourth web page request for accessing the arbitrary web page from the at least one second web page request; determining a ratio between the number of the at least one fourth web page request and the number of the at least one second web page request as the second TGI.

305. And the server determines the TGI corresponding to any webpage based on the first TGI and the second TGI.

In some embodiments, the server determines a ratio between the first TGI and the second TGI as the TGI corresponding to the any web page. That is, the final actual TGI of any web page is represented as the ratio of the first TGI of any web page in the conversion crowd to the second TGI of any web page in the general crowd, and can be represented by the following formula:

wherein the content of the first and second substances,the first TGI that represents the http _ req of any web page in the conversion population,a second TGI indicating that the http _ req of any web page is in the general population.

In step 302-.

In other embodiments, the server may also directly obtain the TGI of any web page from the ratio between the number of web page requests for accessing any web page and the number of the web page requests, instead of obtaining different TGIs for different account numbers, so that the TGI obtaining process of each web page can be simplified.

Optionally, the TGI of any web page in the above manner may be expressed as the following formula:

wherein, num_{http_req}Indicating the number of accesses of any web page http _ req (i.e. the number of accesses of any web page http _ req)Number of web page requests to access the any web page http _ req), num_allIndicating the number of accesses (i.e., the number of the multiple web page requests) for all http _ req.

306. The server determines a plurality of first web pages from the plurality of web pages based on the plurality of TGIs of the plurality of web pages, the TGIs of the plurality of first web pages meeting a first target condition.

Step 306 is similar to step 203, and is not described herein.

307. The server obtains a first access frequency of at least one first account to the plurality of first webpages respectively and a second access frequency of at least one second account to the plurality of first webpages respectively based on webpage requests corresponding to the plurality of first webpages.

308. The server trains and obtains an account classification model based on the first access frequency and the second access frequency, wherein the account classification model is used for identifying account types of input accounts, and the account types at least comprise the first account and the second account.

In some embodiments, the server constructs a positive sample frequency feature based on the first access frequency; constructing a negative sample frequency characteristic based on the second access frequency; and iteratively training an initial classification model based on the positive sample frequency characteristic and the negative sample frequency characteristic to obtain the account classification model.

In the above process, the server respectively constructs the frequency feature of the first account and the frequency feature of the second account based on the first access frequency and the second access frequency, so that the frequency feature of the first account is used as a positive sample, the frequency feature of the second account is used as a negative sample, and a two-classification model, that is, an account classification model, is obtained through training.

309. The server obtains a plurality of characteristic importance parameters of the first webpages based on the account classification model, wherein the characteristic importance parameters are used for indicating the importance degree of the access frequency of the input account to any first webpage compared with the access frequency of the input account to other first webpages in the process of identifying the account category.

In some embodiments, the feature importance parameter includes, but is not limited to, the following types:

310. The server determines a plurality of second webpages from the plurality of first webpages based on a plurality of characteristic importance parameters of the plurality of first webpages, wherein the characteristic importance parameters of the plurality of second webpages meet a second target condition.

All the above optional technical solutions can be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Fig. 4 is a schematic structural diagram of a data processing platform according to an embodiment of the present application, please refer to fig. 4, where the data processing platform includes a data acquisition module 401, a data preprocessing module 402, a high-value traffic mining module 403, and a result storage module 404. The data collection module 401 is configured to collect, in real time, HTTP requests (i.e., web page requests) received by the server, and store the HTTP requests in the target database. The data preprocessing module 402 is used for preprocessing the HTTP request stored in the target database. The high-value traffic mining module 403 is configured to obtain a first TGI of each web page in the conversion crowd and a second TGI of each web page in the common crowd based on the preprocessed HTTP request, obtain a comprehensive TGI of each web page according to the first TGI and the second TGI of each web page, obtain a crowd coverage of each web page with respect to the conversion crowd, filter and obtain a first web page (i.e., potential high-value traffic) based on the comprehensive TGI and the crowd coverage of each web page, further construct a positive and negative sample data set based on access frequencies of the conversion crowd and the common crowd to the first web page, train an account classification model, and filter and obtain a second web page (i.e., final high-value traffic) according to the feature importance parameter in the training process. The result storage module 404 is used to store the structure output by the high-value traffic mining module 403 in the target database.

Fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, please refer to fig. 5, where the apparatus includes:

a first determining module 501, configured to determine, based on a plurality of web page requests in a target time period, a plurality of web pages accessed by the plurality of web page requests;

a first obtaining module 502, configured to obtain a plurality of target group indexes TGI corresponding to the plurality of web pages, where the TGI of any one of the plurality of web pages is used to indicate an importance degree of the any one of the plurality of web pages;

a second determining module 503, configured to determine a plurality of first web pages from the plurality of web pages based on the TGIs of the plurality of web pages, where the TGIs of the plurality of first web pages meet a first target condition;

the second determining module 503 is further configured to determine a plurality of second webpages from the plurality of first webpages based on a plurality of feature importance parameters of the plurality of first webpages, where the feature importance parameters of the plurality of second webpages meet a second target condition.

According to the device provided by the embodiment of the application, the webpage requests in the target time period are accumulated, the first webpages meeting the first target condition are primarily screened out based on the webpage TGI indexes visited by the webpage requests, the second webpages meeting the second target condition are secondarily screened out from the first webpages, the labor cost of the excavation work of the webpages with high business values is greatly reduced, the webpages with high business values can be automatically excavated from massive webpage requests, and the processing efficiency of webpage request data is improved.

In a possible implementation manner, based on the apparatus composition of fig. 5, the first obtaining module 502 includes:

a second determining unit, configured to determine, for any one of the multiple webpages, based on the at least one first webpage request, a first TGI that is triggered by the at least one first account and accesses the any webpage;

a third determining unit, configured to determine, based on the at least one second web page request, a second TGI that is triggered by the at least one second account and accesses the any web page;

and a fourth determining unit, configured to determine a TGI corresponding to the any web page based on the first TGI and the second TGI.

In one possible embodiment, the second determination unit is configured to:

determining at least one third webpage request for accessing any webpage from the at least one first webpage request;

determining a ratio between the number of the at least one third web page request and the number of the at least one first web page request as the first TGI.

In one possible embodiment, the third determination unit is configured to:

determining at least one fourth webpage request for accessing any webpage from the at least one second webpage request;

determining a ratio between the number of the at least one fourth web page request and the number of the at least one second web page request as the second TGI.

In one possible implementation, the fourth determining unit is configured to:

and determining the ratio of the first TGI to the second TGI as the TGI corresponding to the any webpage.

In a possible embodiment, based on the apparatus composition of fig. 5, the apparatus further comprises:

the training module is used for training to obtain an account classification model based on the first access frequency and the second access frequency, the account classification model is used for identifying account types of input accounts, and the account types at least comprise the first account and the second account;

the third obtaining module is configured to obtain, based on the account classification model, the feature importance parameters of the first webpages, where the feature importance parameters are used to indicate an importance degree of access frequency of an input account to any one first webpage compared with access frequency of the input account to other first webpages in a process of identifying account categories.

In one possible embodiment, the training module is configured to:

constructing a positive sample frequency characteristic based on the first access frequency;

constructing a negative sample frequency characteristic based on the second access frequency;

and iteratively training an initial classification model based on the positive sample frequency characteristic and the negative sample frequency characteristic to obtain the account classification model.

All the above optional technical solutions can be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

It should be noted that: in the data processing apparatus provided in the above embodiment, when processing data, only the division of the above functional modules is taken as an example, and in practical applications, the above functions can be distributed by different functional modules as needed, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the above described functions. In addition, the data processing apparatus and the data processing method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the data processing method embodiments and are not described herein again.

Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 600 may generate a relatively large difference due to different configurations or performances, the computer device 600 includes one or more processors (CPUs) 601 and one or more memories 602, where the memories 602 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 601 to implement the data Processing method according to the embodiments. Optionally, the computer device 600 further has a wired or wireless network interface, a keyboard, an input/output interface, and other components to facilitate input and output, and the computer device 600 further includes other components for implementing the device functions, which are not described herein again.

In an exemplary embodiment, a computer readable storage medium, such as a memory including at least one computer program, which is executable by a processor in a computer device to perform the data processing method in the above embodiments, is also provided. For example, the computer-readable storage medium includes a ROM (Read-Only Memory), a RAM (Random-Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or computer program is also provided, comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. The one or more processors of the computer device can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute to complete the data processing method in the above-described embodiment.

Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments can be implemented by hardware, or can be implemented by a program instructing relevant hardware, and optionally, the program is stored in a computer readable storage medium, and optionally, the above mentioned storage medium is a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

23页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于内容中包括的项目图像来更新在线系统用户的简档

Data processing method and device, computer equipment and storage medium

相关技术

网友询问留言