Model training monitoring method, device, equipment and storage medium

文档序号：1953955 发布日期：2021-12-10 浏览：26次中文

阅读说明：本技术 模型训练监控方法、装置、设备及存储介质 (Model training monitoring method, device, equipment and storage medium ) 是由董萍于 2021-09-18 设计创作，主要内容包括：本发明涉及人工智能领域,公开了一种模型训练监控方法、装置、设备及存储介质,该方法包括：接收模型训练监控请求,并对模型训练监控请求携带的意图触发话术文本进行解析并生成测试用例；对测试用例进行分词处理得到测试用例字符,根据测试用例字符进行意图识别,得到第一意图；根据基础话术意图集对测试用例进行相关性分析,识别测试用例的第二意图；将第一意图和第二意图进行比较,根据比较的结果生成测试报告。本发明实现了模型训练监控的自动化,且提高了模型训练监控的效率和准确度。此外,本发明还涉及区块链领域,意图触发话术文本和基础话术意图集可存储于区块链中。(The invention relates to the field of artificial intelligence, and discloses a model training monitoring method, a device, equipment and a storage medium, wherein the method comprises the following steps: receiving a model training monitoring request, analyzing an intention triggering phonetics text carried by the model training monitoring request and generating a test case; performing word segmentation processing on the test case to obtain a test case character, and performing intention recognition according to the test case character to obtain a first intention; performing correlation analysis on the test case according to the basic conversational intention set, and identifying a second intention of the test case; and comparing the first intention with the second intention, and generating a test report according to the comparison result. The invention realizes the automation of model training monitoring and improves the efficiency and the accuracy of the model training monitoring. In addition, the invention relates to the field of blockchains, where intent-triggering linguistic text and a set of underlying linguistic intents may be stored.)

1. A model training monitoring method is characterized by comprising the following steps:

acquiring a training period of a machine learning model, analyzing the training period, determining the total training step number of the machine learning model, and determining a fixed training step number according to the total training step number of the machine learning model;

taking the time node of the machine learning model for completing each fixed training step number as a check point of the model;

acquiring index data generated by the machine learning model at each check point;

according to a preset index monitoring strategy of each index, carrying out abnormity monitoring on each index in the index data, and judging whether each index in the index data is abnormal or not to obtain an abnormity monitoring result;

and generating a model training monitoring report according to the abnormal monitoring result.

2. The model training monitoring method according to claim 1, wherein the monitoring of the abnormality of each index in the index data according to a preset index monitoring strategy of each index, and determining whether each index in the index data is abnormal, and obtaining an abnormality monitoring result comprises:

according to a preset sample index monitoring strategy, carrying out abnormity monitoring on sample indexes in the index data, judging whether the sample indexes are abnormal or not, and obtaining an abnormity monitoring result;

or, according to a preset training duration index monitoring strategy, carrying out abnormity monitoring on a training duration index in the index data, and judging whether the training duration index is abnormal or not to obtain an abnormity monitoring result;

or according to a preset data index monitoring strategy, performing abnormity monitoring on data indexes in the index data, and judging whether the data indexes are abnormal to obtain an abnormity monitoring result, wherein the data indexes comprise deviation values, resource data quantity and data quantity of available data storage space.

3. The model training monitoring method according to claim 2, wherein the monitoring of the abnormality of the sample index in the index data according to a preset sample index monitoring strategy to determine whether the sample index is abnormal or not, and obtaining the abnormality monitoring result comprises:

extracting sample indexes and training samples in the index data, and performing equal-frequency binning processing on the training samples to obtain a plurality of bins;

calculating a model stability analysis value of the samples in each box according to a preset sample index monitoring strategy and the sample indexes;

judging whether the model stability analysis value is smaller than a preset model stability threshold value or not;

if so, determining that the sample index is abnormal, and obtaining an abnormal monitoring result corresponding to the sample index.

4. The model training monitoring method according to claim 3, wherein the monitoring of the abnormality of the training duration index in the index data according to a preset training duration index monitoring strategy to determine whether the training duration index is abnormal or not, and obtaining the abnormality monitoring result comprises:

extracting a duration training index in the index data and the training duration of the machine learning model when each fixed training step is completed;

judging whether the training time length is greater than a preset training time length threshold value or not according to a preset training time length index monitoring strategy;

if so, determining that the training duration index is abnormal, and obtaining an abnormal monitoring result corresponding to the training duration index.

5. The model training monitoring method according to any one of claims 2 to 4, wherein the monitoring of the abnormality of the data index in the index data according to a preset data index monitoring strategy to determine whether the data index is abnormal or not, and obtaining the result of the abnormality monitoring comprises:

extracting data indexes in the index data and loss values of a loss function of the machine learning model;

calculating a mean and a standard deviation of the loss values;

acquiring a current loss value of the machine learning model, and taking a difference value between the current loss value of the machine learning model and the average value as a deviation value;

judging whether the multiple of the deviation value and the standard deviation exceeds a preset multiple or not according to a preset data index monitoring strategy;

if so, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

6. The model training monitoring method according to any one of claims 2 to 4, wherein the monitoring of the abnormality of the data index in the index data according to a preset data index monitoring strategy to determine whether the data index is abnormal or not, and obtaining the result of the abnormality monitoring comprises:

extracting data indexes in the index data and resource data amount occupied by the machine learning model in the check point;

judging whether the resource data volume is larger than a preset occupied data volume threshold value or not according to a preset data index monitoring strategy;

if so, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

7. The model training monitoring method according to any one of claims 2 to 4, wherein the monitoring of the abnormality of the data index in the index data according to a preset data index monitoring strategy to determine whether the data index is abnormal or not, and obtaining the result of the abnormality monitoring comprises:

extracting data indicators in the indicator data and a data amount of data storage space available to the machine learning model in the checkpoint;

judging whether the data volume of the data storage space is smaller than a preset available data volume threshold value or not according to a preset data index monitoring strategy;

if so, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

8. A model training monitoring device, the model training monitoring device comprising:

the analysis module is used for acquiring a training period of a machine learning model, analyzing the training period, determining the total training step number of the machine learning model, and determining a fixed training step number according to the total training step number of the machine learning model;

a checkpoint determining module, configured to use a time node at which the machine learning model completes each fixed training step as a checkpoint of the model;

the acquisition module is used for acquiring index data generated by the machine learning model at each check point;

the monitoring module is used for monitoring the abnormality of each index in the index data according to a preset index monitoring strategy of each index, judging whether each index in the index data is abnormal or not and obtaining an abnormal monitoring result;

and the report generating module is used for generating a model training monitoring report according to the abnormal monitoring result.

9. A model training monitoring device, comprising:

a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invoking the instructions in the memory to cause the model training monitoring device to perform the steps of the model training monitoring method of any of claims 1-7.

10. A computer readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of the model training monitoring method according to any one of claims 1-7.

Technical Field

The invention relates to the field of artificial intelligence, in particular to a model training monitoring method, a model training monitoring device, model training monitoring equipment and a storage medium.

Background

The model is an object (the object is not equal to an object, and is not limited to a solid body and a virtual body, and is not limited to a plane and a solid body) which forms an expression purpose for objectively describing a morphological structure through subjective consciousness by means of a solid body or a virtual body. The entity under study is necessarily simplified and its main features are described with appropriate forms or rules of presentation. The resulting system mimic is referred to as a model. The model is attenuated, and the capability of the model to run result data is reduced along with the updating of input data, so that the performance of the model needs to be monitored in time to realize the maintenance and updating of the model.

In the prior art, a machine learning model is obtained through training of an open-source machine learning platform, and a general algorithm of the training model is arranged in the machine learning platform, so that the machine learning model can be obtained only by inputting training data on the machine learning platform, and the model training process is automatically executed in the machine learning platform. However, the method cannot constantly monitor the process of training the model by the machine learning platform and cannot acquire the state of model training in time, so that when a problem occurs in the process of training the model, the model cannot be corrected in time, the efficiency of model training and monitoring is low, and the trained model is inaccurate.

Disclosure of Invention

The invention mainly aims to solve the technical problem of low efficiency of model training monitoring in the prior art.

The invention provides a model training monitoring method in a first aspect, which comprises the following steps: acquiring a training period of a machine learning model, analyzing the training period, determining the total training step number of the machine learning model, and determining a fixed training step number according to the total training step number of the machine learning model; taking the time node of the machine learning model for completing each fixed training step number as a check point of the model; acquiring index data generated by the machine learning model at each check point; according to a preset index monitoring strategy of each index, carrying out abnormity monitoring on each index in the index data, and judging whether each index in the index data is abnormal or not to obtain an abnormity monitoring result; and generating a model training monitoring report according to the abnormal monitoring result.

Optionally, in a first implementation manner of the first aspect of the present invention, the monitoring abnormality of each index in the index data according to a preset index monitoring policy of each index, and determining whether each index in the index data is abnormal, where obtaining an abnormality monitoring result includes: according to a preset sample index monitoring strategy, carrying out abnormity monitoring on sample indexes in the index data, judging whether the sample indexes are abnormal or not, and obtaining an abnormity monitoring result; or, according to a preset training duration index monitoring strategy, carrying out abnormity monitoring on a training duration index in the index data, and judging whether the training duration index is abnormal or not to obtain an abnormity monitoring result; or according to a preset data index monitoring strategy, performing abnormity monitoring on data indexes in the index data, and judging whether the data indexes are abnormal to obtain an abnormity monitoring result, wherein the data indexes comprise deviation values, resource data quantity and data quantity of available data storage space.

Optionally, in a second implementation manner of the first aspect of the present invention, the monitoring sample indexes in the index data according to a preset sample index monitoring policy to determine whether the sample indexes are abnormal, and obtaining an abnormal monitoring result includes: extracting sample indexes and training samples in the index data, and performing equal-frequency binning processing on the training samples to obtain a plurality of bins; calculating a model stability analysis value of the samples in each box according to a preset sample index monitoring strategy and the sample indexes; judging whether the model stability analysis value is smaller than a preset model stability threshold value or not; if so, determining that the sample index is abnormal, and obtaining an abnormal monitoring result corresponding to the sample index.

Optionally, in a third implementation manner of the first aspect of the present invention, the monitoring, according to a preset training duration index monitoring policy, performing anomaly monitoring on the training duration index in the index data, and determining whether the training duration index is abnormal, and obtaining an anomaly monitoring result includes: extracting a duration training index in the index data and the training duration of the machine learning model when each fixed training step is completed; judging whether the training time length is greater than a preset training time length threshold value or not according to a preset training time length index monitoring strategy; if so, determining that the training duration index is abnormal, and obtaining an abnormal monitoring result corresponding to the training duration index.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the monitoring abnormality of the data index in the index data according to a preset data index monitoring policy, and determining whether the data index is abnormal, where obtaining an abnormality monitoring result includes: extracting data indexes in the index data and loss values of a loss function of the machine learning model; calculating a mean and a standard deviation of the loss values; acquiring a current loss value of the machine learning model, and taking a difference value between the current loss value of the machine learning model and the average value as a deviation value; judging whether the multiple of the deviation value and the standard deviation exceeds a preset multiple or not according to a preset data index monitoring strategy; if so, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the monitoring abnormality of the data index in the index data according to a preset data index monitoring policy, and determining whether the data index is abnormal, where obtaining an abnormality monitoring result includes: extracting data indexes in the index data and resource data amount occupied by the machine learning model in the check point; judging whether the resource data volume is larger than a preset occupied data volume threshold value or not according to a preset data index monitoring strategy; if so, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the monitoring abnormality of the data index in the index data according to a preset data index monitoring policy, and determining whether the data index is abnormal, where obtaining an abnormality monitoring result includes: extracting data indicators in the indicator data and a data amount of data storage space available to the machine learning model in the checkpoint; judging whether the data volume of the data storage space is smaller than a preset available data volume threshold value or not according to a preset data index monitoring strategy; if so, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

A second aspect of the present invention provides a model training monitoring apparatus, including: the analysis module is used for acquiring a training period of a machine learning model, analyzing the training period, determining the total training step number of the machine learning model, and determining a fixed training step number according to the total training step number of the machine learning model; a checkpoint determining module, configured to use a time node at which the machine learning model completes each fixed training step as a checkpoint of the model; the acquisition module is used for acquiring index data generated by the machine learning model at each check point; the monitoring module is used for monitoring the abnormality of each index in the index data according to a preset index monitoring strategy of each index, judging whether each index in the index data is abnormal or not and obtaining an abnormal monitoring result; and the report generating module is used for generating a model training monitoring report according to the abnormal monitoring result.

Optionally, in a first implementation manner of the second aspect of the present invention, the monitoring module includes: the sample monitoring unit is used for carrying out abnormity monitoring on the sample indexes in the index data according to a preset sample index monitoring strategy, judging whether the sample indexes are abnormal or not and obtaining an abnormity monitoring result; the time length monitoring unit is used for monitoring the training time length index in the index data in an abnormal mode according to a preset training time length index monitoring strategy, judging whether the training time length index is abnormal or not and obtaining an abnormal monitoring result; and the data monitoring unit is used for monitoring the data indexes in the index data in an abnormal mode according to a preset data index monitoring strategy, judging whether the data indexes are abnormal or not, and obtaining an abnormal monitoring result, wherein the data indexes comprise deviation values, resource data quantity and data quantity of available data storage space.

Optionally, in a second implementation manner of the second aspect of the present invention, the sample monitoring unit is specifically configured to: the sub-box dividing unit is used for extracting sample indexes and training samples in the index data, and performing equal-frequency box dividing processing on the training samples to obtain a plurality of boxes; the calculating subunit is used for calculating a model stability analysis value of each box sample according to a preset sample index monitoring strategy and the sample index; the first judgment subunit is used for judging whether the model stability analysis value is smaller than a preset model stability threshold value; and the first determining subunit is configured to determine that the sample index is abnormal if the model stability analysis value is smaller than a preset model stability threshold value, so as to obtain an abnormal monitoring result corresponding to the sample index.

Optionally, in a third implementation manner of the second aspect of the present invention, the duration monitoring unit is specifically configured to: the extraction subunit is used for extracting a duration training index in the index data and the training duration of the machine learning model when each fixed training step number is completed; the second judgment subunit is used for judging whether the training time length is greater than a preset training time length threshold value according to a preset training time length index monitoring strategy; and the second determining subunit is configured to determine that the training duration index is abnormal if the training duration is greater than a preset training duration threshold, so as to obtain an abnormal monitoring result corresponding to the training duration index.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the data monitoring unit is specifically configured to: extracting data indexes in the index data and loss values of a loss function of the machine learning model; calculating a mean and a standard deviation of the loss values; acquiring a current loss value of the machine learning model, and taking a difference value between the current loss value of the machine learning model and the average value as a deviation value; judging whether the multiple of the deviation value and the standard deviation exceeds a preset multiple or not according to a preset data index monitoring strategy; and if the multiple of the deviation value and the standard deviation exceeds a preset multiple, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the data monitoring unit is further specifically configured to: extracting data indexes in the index data and resource data amount occupied by the machine learning model in the check point; judging whether the resource data volume is larger than a preset occupied data volume threshold value or not according to a preset data index monitoring strategy; and if the resource data amount is larger than a preset occupied data amount threshold value, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the data monitoring unit is further specifically configured to: extracting data indicators in the indicator data and a data amount of data storage space available to the machine learning model in the checkpoint; judging whether the data volume of the data storage space is smaller than a preset available data volume threshold value or not according to a preset data index monitoring strategy; and if the data quantity of the data storage space is smaller than a preset available data quantity threshold value, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

A third aspect of the present invention provides a model training monitoring apparatus, including: a memory having a computer program stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the computer program in the memory to cause the model training monitoring device to perform the steps of the model training monitoring method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the steps of the model training monitoring method described above.

In the technical scheme provided by the invention, the total training step number and the fixed training step number of the machine learning model are determined by analyzing the training period of the machine learning model; taking a time node of each fixed training step completed by the machine learning model as a check point of the model; acquiring index data generated by the machine learning model at each check point, performing abnormity monitoring on each index in the index data according to a preset index monitoring strategy of each index, and judging whether each index in the index data is abnormal to obtain an abnormity monitoring result; the method and the device realize model training monitoring of the machine learning model, and can intuitively judge whether each index is abnormal in the model training process according to index data generated by model training, so that the efficiency of model training monitoring is improved, and the accuracy and the reliability of the trained model in actual application are improved.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of a model training monitoring method according to an embodiment of the present invention;

FIG. 2 is a diagram of a second embodiment of a model training monitoring method according to an embodiment of the present invention;

FIG. 3 is a diagram of a third embodiment of a model training monitoring method according to an embodiment of the present invention;

FIG. 4 is a diagram of a fourth embodiment of a model training monitoring method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a model training monitoring apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another embodiment of a model training monitoring device according to an embodiment of the present invention;

FIG. 7 is a diagram of an embodiment of a model training monitoring device in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a model training monitoring method, a model training monitoring device, model training monitoring equipment and a storage medium, wherein the total training step number and the fixed training step number of a machine learning model are determined by analyzing the training period of the machine learning model; taking a time node of each fixed training step completed by the machine learning model as a check point of the model; acquiring index data generated by the machine learning model at each check point, performing abnormity monitoring on each index in the index data according to a preset index monitoring strategy of each index, and judging whether each index in the index data is abnormal to obtain an abnormity monitoring result; the embodiment of the invention realizes the model training monitoring of the machine learning model, and can intuitively judge whether each index is abnormal in the model training process according to the index data generated by the model training, thereby improving the efficiency of the model training monitoring and improving the accuracy and reliability of the trained model in the actual application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For the sake of understanding, the following describes specific contents of an embodiment of the present invention, and referring to fig. 1, a first embodiment of a model training monitoring method according to an embodiment of the present invention includes:

101, acquiring a training period of a machine learning model, analyzing the training period, determining the total training step number of the machine learning model, and determining a fixed training step number according to the total training step number of the machine learning model;

102, taking a time node of the machine learning model for finishing each fixed training step number as a check point of the machine learning model;

the server obtains a training period set in the model training process of the machine learning model, analyzes the training period and determines the total training steps of the machine learning model, wherein training data are input into the machine learning model to complete one-time model training, and the training steps are one step. And determining a fixed training step number according to the training total step number determined by the machine learning model, and taking a time node of the machine learning model for completing one fixed training step number as a check point of the machine learning model, for example, setting the training total step number of the machine learning model to 2000 steps, determining the fixed training step number to 100 steps according to the set training total step number of the machine learning model to 2000 steps, taking a node of the machine learning model after each 100 steps of training is completed as a preset check point, namely, taking a node of the machine learning model for completing the 100 th step of training as a first check point, taking a node of the machine learning model for completing the 200 th step of training as a second check point, taking a node of the machine learning model for completing the 300 th step of training as a third check point, and so on, wherein a plurality of check points exist in the training process of the machine learning model.

In this embodiment, the server creates a configuration file in the model training process, where the configuration file includes pre-wide table information, feature engineering information, and model information in the model training process, the pre-wide table information includes a database name, a list name, and a table name, and the feature engineering information includes a binning strategy, an index monitoring strategy, and the like; the model information includes a name of a model used, parameters of the model, names of features to which the model relates, and the like.

And the server creates a hive table to store the result data of model training monitoring into the hive table. In the process of model training of the machine learning model, putting the result of model training monitoring into a spark temporary table; then inserting the information of the spark temporary table into the hive table; encapsulating the spark task in a preset shell script, and uploading information in a configuration File to a preset HdfsHadoop Distributed File System (HDFS), wherein the HDFS is a Distributed File System (Distributed File System) designed to be suitable for running on general hardware (comfort hardware). And (3) performing model online according to a preset linkdo standard online mode and a model application scene (mainly a scheduling period) set in the model training process.

In addition, the embodiment of the invention can acquire and analyze the training period of the machine learning model based on the artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

103, acquiring index data generated by the machine learning model at each check point;

the server develops a spark task, reads the configuration file through the spark task, and calls the model monitoring class. Reading wide table data information, characteristic engineering information (mainly binning and index monitoring result information) and a model file; reading the wide table, and performing data analysis and characteristic engineering processing such as data binning, coding, characteristic combination and the like on data contained in the wide table to form index data corresponding to each index; and then carrying out model training monitoring on the machine learning model according to the index data. The server obtains index data corresponding to each index generated by the machine learning model at each check point, wherein the index data comprises a sample index, a training duration index and a data index.

104, monitoring the abnormality of each index in the index data according to a preset index monitoring strategy of each index, and judging whether each index in the index data is abnormal or not to obtain an abnormal monitoring result;

the server presets index monitoring strategies of all indexes, namely the index monitoring strategies comprise a sample index monitoring strategy, a training time index monitoring strategy and a data index monitoring strategy. And carrying out abnormity monitoring on each index according to an index monitoring strategy corresponding to each index, and judging whether each index is abnormal or not to obtain an abnormity monitoring result. And the server takes the abnormal monitoring result corresponding to each index as the abnormal monitoring result of the machine learning model in the model training process. In addition, one of the abnormal monitoring result of the sample index, the abnormal monitoring result of the training duration index and the abnormal monitoring result of the data index can be used as the abnormal monitoring result of the machine learning model.

And the server monitors the sample indexes in the index data according to the sample index monitoring strategy corresponding to the sample indexes to judge whether the sample indexes are abnormal or not, and obtains abnormal monitoring results corresponding to the sample indexes. Or carrying out abnormity monitoring on the training time length index in the index data according to a corresponding training time length index monitoring strategy, judging whether the training time length index is abnormal or not, and obtaining an abnormity monitoring result corresponding to the training time length index according to a judgment result. Or carrying out abnormity monitoring on the data indexes in the index data according to the corresponding data index monitoring strategy, judging whether the data indexes are abnormal or not, and generating abnormity monitoring results corresponding to the data indexes according to the judgment results. In this embodiment, the anomaly monitoring result corresponding to each index obtained by performing anomaly monitoring according to the sample index, the training duration index and the data index can be used as the anomaly monitoring result of the machine learning model in the model training process.

And 105, generating a model training monitoring report according to the abnormal monitoring result.

The server monitors the index data of each check point, generates a model training monitoring report according to the abnormal monitoring result of the machine learning model, analyzes the abnormal monitoring result, and sends abnormal prompt information of the abnormal model training to the server when any index in the index data is found to be abnormal. The subsequent server can analyze the model training monitoring report, determine the index of the abnormal condition in the model training process, the reason of the abnormal condition of the index and the like, continuously adjust the model parameters and update the training state of the model according to the analysis of the model training monitoring report until the machine learning model finishes training to obtain the model which accords with the expected result.

In the embodiment, a scheduling system is adopted in the process of realizing model training monitoring, so that problems can be found in time; the wide table data and the abnormal monitoring result do not need to be repeatedly copied to a plurality of environments; distributed monitoring can be realized through online of spark tasks, the efficiency of model training monitoring is improved, and the stability of model training is enhanced; and in the whole model training and monitoring process, multiple systems are not needed, the problems of access permission and the like do not exist, a model training and monitoring report is generated according to an abnormal monitoring result, the abnormal condition of the model is analyzed and reminded in real time, the model is conveniently and timely optimized, the high-efficiency and stable operation of the model is ensured, and business personnel can check the indexes through index data corresponding to each index monitored by the model, objectively know the operation condition of the model, and the workload of the modeling personnel for monitoring the model is simplified.

In the embodiment of the invention, the index data generated by the machine learning model at each check point is obtained, and each index in the index data is abnormally monitored according to the index monitoring strategy corresponding to each index, so that whether each index is abnormal is judged, and an abnormal monitoring result is obtained. The embodiment of the invention realizes the automation of model training monitoring, monitors the abnormality of each index according to the index monitoring strategy, can intuitively judge whether each index is abnormal in the model training process, and improves the efficiency of model training monitoring.

Referring to fig. 2, a second embodiment of the model training monitoring method according to the embodiment of the present invention includes:

201, acquiring a training period of a machine learning model, analyzing the training period, determining the total training step number of the machine learning model, and determining a fixed training step number according to the total training step number of the machine learning model;

202, taking the time node of the machine learning model for finishing each fixed training step number as a check point of the machine learning model;

203, acquiring index data generated by the machine learning model at each check point;

204, extracting sample indexes in the index data and training samples of the machine learning model, and performing equal-frequency binning processing on the training samples to obtain a plurality of bins;

the method comprises the steps of obtaining sample indexes in index data and training samples input by a machine learning model, carrying out equal-frequency binning processing on the training samples input into the machine learning model to obtain a plurality of bins, enabling the sample amount of the training samples in each bin to be basically the same, and taking three representative sample indexes in the embodiment of the invention as the sample indexes of the training samples have more dimensions: bin value (i.e., number of bins), total sample size per bin, and the ratio of the original sample size to the total sample size per bin.

The machine learning model is equally frequency binned according to the sample size, for example, if the sample size of the machine learning model in one day is 4 ten thousand, and the machine learning model is binned into 20 bins, the sample size contained in each bin is 2000. The sample size in the first bin is 2000, wherein the proportion of the original sample size to the total sample size is 60%; the sample size in the second bin was 2000, where the proportion of the original sample size to the total sample size was 65%; by analogy, all training samples in the 20 boxes are calculated according to the three sample indexes.

205, calculating a model stability analysis value of the training sample in each sub-box according to a preset sample index monitoring strategy and a sample index;

and the server calculates the model stability analysis value (PSI value) of the training sample in each sub-box according to a preset sample index monitoring strategy and a sample index. And performing mapping calculation according to the three sample indexes of the box dividing value, the sample volume of each box and the proportion of the original sample volume in the total sample volume in each box to obtain the PSI value of each box, and performing summation operation on the PSI values of each box to obtain PSI index data of the machine learning model, so that the model stability analysis value of the machine learning model can be obtained.

206, judging whether the analysis value of the model stability is smaller than a preset threshold value of the model stability;

207, if the model stability analysis value is smaller than a preset model stability threshold value, determining that the sample index is abnormal, and obtaining an abnormal monitoring result corresponding to the sample index;

the server compares a model stability analysis value (PSI value) of the machine learning model with a preset model stability threshold value, namely, judges whether the PSI value of the machine learning model is smaller than the model stability threshold value; and when the PSI value of the machine learning model is not less than the model stability threshold value, determining that the sample index of the machine learning model is normal, and generating an abnormal monitoring result corresponding to the sample index according to the judgment result.

And 208, generating a model training monitoring report according to the abnormal monitoring result corresponding to the sample index.

And the server monitors the sample indexes of the check points, generates a model training monitoring report according to the abnormal monitoring result corresponding to the sample indexes of the machine learning model, analyzes the abnormal monitoring result corresponding to the sample indexes, and sends abnormal prompt information of the abnormal model training to the server when the sample indexes are found to be abnormal. The subsequent server can analyze the model training monitoring report, determine the occurrence reason of the index abnormality in the model training process and the like, continuously adjust the model parameters and update the training state of the model according to the analysis of the model training monitoring report until the machine learning model finishes training to obtain the model which accords with the expected result.

In the embodiment of the present invention, the steps 201-203 are the same as the steps 101-103 in the first embodiment of the model training monitoring method, and are not described herein again.

In the embodiment of the invention, the sample indexes and the training samples in the index data are extracted, and the training samples are subjected to binning and model stability analysis value calculation, so that the sample indexes are subjected to abnormity monitoring according to the model stability analysis value and a model training monitoring report is generated. The embodiment of the invention realizes the abnormal monitoring of the sample indexes in the model training process, and the abnormal monitoring is carried out according to the model stability analysis value, thereby improving the accuracy of the abnormal monitoring.

Referring to fig. 3, a third embodiment of the model training monitoring method according to the embodiment of the present invention includes:

301, acquiring a training period of the machine learning model, analyzing the training period, determining the total training step number of the machine learning model, and determining a fixed training step number according to the total training step number of the machine learning model;

302, taking the time node of the machine learning model for completing each fixed training step number as a check point of the machine learning model;

303, acquiring index data generated by the machine learning model at each check point;

304, extracting training duration indexes in the index data and training duration when the machine learning model completes each fixed training step number;

305, judging whether the training duration is greater than a preset training duration threshold value according to a preset training duration index monitoring strategy;

306, if the training duration is judged to be greater than the preset training duration threshold, determining that the training duration index is abnormal, and obtaining an abnormal monitoring result corresponding to the training duration index;

the server records the training times and the training duration of the machine learning model when each fixed training step is completed when the machine learning model executes the model training task in the model training process of the machine learning model, and forms the training duration index of the machine learning model.

Extracting training time when the machine learning model in the training time index completes each fixed training step number, and comparing the training time when the machine learning model completes each fixed training step number with a preset training time threshold according to a preset training time index monitoring strategy, namely judging whether the training time when the machine learning model completes each fixed training step number is greater than the preset training time threshold; when the training duration is greater than the training duration threshold, determining that the training duration index of the machine learning model is abnormal; when the training duration is not greater than the training duration threshold, determining that the training duration index of the machine learning model is normal; and generating an abnormal monitoring result corresponding to the training duration index according to the judgment result.

307, generating a model training monitoring report according to the abnormal monitoring result corresponding to the training duration index.

The server monitors the training time length indexes of the check points, generates a model training monitoring report according to the abnormal monitoring results corresponding to the training time length indexes of the machine learning model, analyzes the abnormal monitoring results corresponding to the training time length indexes, and sends abnormal prompt information of model training abnormity to the server when the training time length indexes are found to be abnormal. The subsequent server can analyze the model training monitoring report, determine the occurrence reason of the index abnormality in the model training process and the like, continuously adjust the model parameters and update the training state of the model according to the analysis of the model training monitoring report until the machine learning model finishes training to obtain the model which accords with the expected result.

In the embodiment of the present invention, the steps 301-303 are the same as the steps 101-103 in the first embodiment of the model training monitoring method, and are not described herein again.

According to the embodiment of the invention, the training duration index and the training duration in the index data are extracted, and the training duration index is abnormally monitored according to the training duration monitoring strategy and the training duration to generate the model training monitoring report. The embodiment of the invention realizes the abnormal monitoring of the training duration index in the model training process and improves the efficiency of model monitoring.

Referring to fig. 4, a fourth embodiment of the model training monitoring method according to the embodiment of the present invention includes:

401, acquiring a training period of a machine learning model, analyzing the training period, determining a total training step number of the machine learning model, and determining a fixed training step number according to the total training step number of the machine learning model;

402, taking the time node of the machine learning model for completing each fixed training step number as a check point of the machine learning model;

403, acquiring index data generated by the machine learning model at each check point;

404, extracting data indexes in the index data and loss values of a loss function of the machine learning model;

after the machine learning model starts to be trained, the server monitors the training progress of the machine learning model and the loss value of the machine learning model corresponding to the training progress in real time, wherein the training progress of the machine learning model is the training step number of the machine learning model or the training duration of the machine learning model. The function value of the loss function is a loss value, the loss value is used for measuring the inconsistency degree of the predicted value and the true value of the machine learning model, and the smaller the loss value of the loss function is, the better the robustness of the machine learning model is. And calculating the loss value of the loss function of the machine learning model corresponding to the number of the finished current training steps by the machine learning model after each step of training.

405, calculating the mean and standard deviation of the loss values;

and the server extracts loss values contained in the data indexes of the machine learning model in all the check points, takes the loss values as historical loss values, and carries out mean value and standard deviation operation on the historical loss values to obtain a mean value and a standard deviation corresponding to the loss values of the machine learning model.

406, obtaining a current loss value of the machine learning model, and taking a difference value between the current loss value and the mean value of the machine learning model as a deviation value;

and acquiring a current loss value of the machine learning model, judging whether the current loss value is abnormal or not, taking a difference value between the current loss value of the machine learning model and the average value as a deviation value, and judging whether the machine learning model is abnormal or not in the model training process according to the deviation value.

407, judging whether the multiple of the deviation value and the standard deviation is greater than a preset multiple according to a preset data index monitoring strategy;

408, if the multiple of the deviation value and the standard deviation is larger than the preset multiple, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index;

and the server monitors the data indexes abnormally according to a preset data index monitoring strategy, namely, judges whether the multiple between the obtained deviation value and the standard deviation is larger than a preset multiple. The preset multiple may be two times, three times or four times, the preset multiple is set according to the actual condition of the machine learning model in the model training process, for example, the preset multiple is set to three times, when the multiple of the deviation value and the standard deviation is 3.5 times, and exceeds the preset multiple by 3 times, the data index of the machine learning model is determined to be abnormal, and an abnormal monitoring result corresponding to the data index is generated.

In addition, monitoring the data index also includes monitoring the resource data amount occupied by the machine learning model, that is, the server records the calculation resource condition consumed by executing the model training task in the model training process of the machine learning model, such as the resource group to which the consumed calculation resource belongs, the consumed memory data amount, the consumed CPU data amount, the consumed GPU data amount, and the like, to form the data index of the machine learning model. The server extracts the resource data volume occupied by the machine learning model at each check point in the data indexes, compares the resource data volume occupied by the machine learning model with a preset occupied data volume threshold according to a preset data index monitoring strategy, and judges whether the resource data volume occupied by the machine learning model is larger than the preset occupied data volume threshold. When the resource data volume occupied by the machine learning model is larger than a preset occupied data volume threshold value, determining that the data index of the machine learning model is abnormal; when the resource data volume occupied by the machine learning model is not greater than the occupied data volume threshold value, the data index of the machine learning model is normal; and generating an abnormal monitoring result corresponding to the data index according to the judgment result. In this embodiment, the preset occupied data amount threshold may be set according to an actual situation, and is not limited herein.

In addition, monitoring the data index also includes monitoring the data amount of the data storage space available for the machine learning model, that is, the server records the data storage space occupied by executing the model training task in the model training process of the machine learning model, such as the data storage space occupied by the training data, the data storage space occupied by the machine learning model obtained by training, the data storage space occupied by the result obtained by predicting by using the machine learning model, and the like, so as to form the data index of the machine learning model. The server extracts the data volume of the available data storage space of the machine learning model in each check point, the data volume of the available data storage space in the data index is monitored according to a preset data index monitoring strategy, and the data volume of the available data storage space is compared with a preset available data volume threshold value, namely whether the data volume of the available data storage space is smaller than the preset available data volume threshold value is judged; when the data quantity of the available data storage space is smaller than a preset available data quantity threshold value, determining that the data index of the machine learning model is abnormal; and when the data quantity of the available data storage space is not less than the threshold value of the available data quantity, determining that the data index of the machine learning model is normal, and generating an abnormal monitoring result corresponding to the data index according to the judgment result. In this embodiment, the preset threshold of the available data amount may be set according to actual situations, and is not limited herein.

And 409, generating a model training monitoring report according to the abnormal monitoring result corresponding to the data index.

And the server monitors the data indexes of the check points, generates a model training monitoring report according to the abnormal monitoring result corresponding to the data indexes of the machine learning model, analyzes the abnormal monitoring result corresponding to the data indexes, and sends abnormal prompt information of the abnormal model training to the server when the data indexes are found to be abnormal. The subsequent server can analyze the model training monitoring report, determine the occurrence reason of the index abnormality in the model training process and the like, continuously adjust the model parameters and update the training state of the model according to the analysis of the model training monitoring report until the machine learning model finishes training to obtain the model which accords with the expected result.

In the embodiment of the present invention, the steps 401-403 are the same as the steps 101-103 in the first embodiment of the model training monitoring method, and are not described herein again.

In the embodiment of the invention, the data indexes and the loss values in the index data are extracted, the deviation values of model training are calculated according to the loss values, and the data indexes are subjected to abnormity monitoring according to the deviation values to generate the model training monitoring report. According to the embodiment of the invention, the abnormal monitoring of the data indexes in the model training process is realized, and the abnormal monitoring of the model training process is carried out according to the calculated deviation value, so that the accuracy of the abnormal monitoring is improved.

With reference to fig. 5, the model training monitoring device in the embodiment of the present invention includes:

an analysis module 501, configured to obtain a training period of a machine learning model, analyze the training period, determine a total training step number of the machine learning model, and determine a fixed training step number according to the total training step number of the machine learning model;

a checkpoint determining module 502, configured to use a time node at which the machine learning model completes each fixed training step as a checkpoint of the model;

an obtaining module 503, configured to obtain index data generated by the machine learning model at each of the check points;

a monitoring module 504, configured to perform anomaly monitoring on each index in the index data according to a preset index monitoring policy for each index, and determine whether each index in the index data is abnormal, so as to obtain an anomaly monitoring result;

and a report generating module 505, configured to generate a model training monitoring report according to the abnormal monitoring result.

In the embodiment of the invention, the model training monitoring device is used for acquiring the index data generated by the machine learning model at each check point, carrying out abnormity monitoring on each index in the index data according to the index monitoring strategy corresponding to each index, and judging whether each index is abnormal or not so as to obtain an abnormity monitoring result. The embodiment of the invention realizes the automation of model training monitoring, monitors the abnormality of each index according to the index monitoring strategy, can intuitively judge whether each index is abnormal in the model training process, and improves the efficiency of model training monitoring.

Referring to fig. 6, another embodiment of the model training monitoring apparatus in the embodiment of the present invention includes:

a checkpoint determining module 502, configured to use a time node at which the machine learning model completes each fixed training step as a checkpoint of the model;

an obtaining module 503, configured to obtain index data generated by the machine learning model at each of the check points;

and a report generating module 505, configured to generate a model training monitoring report according to the abnormal monitoring result.

Wherein the monitoring module 504 comprises:

the sample monitoring unit 5041 is configured to perform anomaly monitoring on a sample index in the index data according to a preset sample index monitoring policy, determine whether the sample index is abnormal, and obtain an anomaly monitoring result;

a time duration monitoring unit 5042, configured to perform anomaly monitoring on the training time duration indicator in the indicator data according to a preset training time duration indicator monitoring strategy, determine whether the training time duration indicator is abnormal, and obtain an anomaly monitoring result;

and the data monitoring unit 5043 is configured to perform anomaly monitoring on data indexes in the index data according to a preset data index monitoring policy, determine whether the data indexes are abnormal, and obtain an anomaly monitoring result, where the data indexes include a deviation value, a resource data amount, and a data amount of an available data storage space.

Wherein the sample monitoring unit 5041 comprises:

a binning unit 50411, configured to extract sample indexes and training samples in the index data, and perform equal-frequency binning processing on the training samples to obtain multiple bins;

the calculating subunit 50412 is configured to calculate a model stability analysis value of each of the samples in the bins according to a preset sample index monitoring policy and according to the sample index;

a first determining subunit 50413, configured to determine whether the model stability analysis value is smaller than a preset model stability threshold;

a first determining subunit 50414, configured to determine that the sample index is abnormal if the model stability analysis value is smaller than a preset model stability threshold, so as to obtain an abnormal monitoring result corresponding to the sample index.

Wherein the duration monitoring unit 5042 includes:

an extracting subunit 50421, configured to extract a duration training indicator in the indicator data and a training duration when the machine learning model completes each fixed training step number;

a second judging subunit 50422, configured to judge, according to a preset training duration index monitoring policy, whether the training duration is greater than a preset training duration threshold;

a second determining subunit 50423, configured to determine that the training duration index is abnormal if the training duration is greater than a preset training duration threshold, so as to obtain an abnormal monitoring result corresponding to the training duration index.

The data monitoring unit 5043 is specifically configured to:

extracting data indexes in the index data and loss values of a loss function of the machine learning model;

calculating a mean and a standard deviation of the loss values;

acquiring a current loss value of the machine learning model, and taking a difference value between the current loss value of the machine learning model and the average value as a deviation value;

judging whether the multiple of the deviation value and the standard deviation exceeds a preset multiple or not according to a preset data index monitoring strategy;

and if the multiple of the deviation value and the standard deviation exceeds a preset multiple, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

Wherein, the data monitoring unit 5043 is further specifically configured to:

extracting data indexes in the index data and resource data amount occupied by the machine learning model in the check point;

judging whether the resource data volume is larger than a preset occupied data volume threshold value or not according to a preset data index monitoring strategy;

and if the resource data amount is larger than a preset occupied data amount threshold value, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

Wherein, the data monitoring unit 5043 is further specifically configured to:

extracting data indicators in the indicator data and a data amount of data storage space available to the machine learning model in the checkpoint;

judging whether the data volume of the data storage space is smaller than a preset available data volume threshold value or not according to a preset data index monitoring strategy;

and if the data quantity of the data storage space is smaller than a preset available data quantity threshold value, determining that the data index is abnormal, and obtaining an abnormal monitoring result corresponding to the data index.

In the embodiment of the invention, each index in the index data is extracted through the model training monitoring device, and each index is subjected to abnormity monitoring to generate a model training monitoring report. The embodiment of the invention realizes the abnormal monitoring of each index in the model training process and improves the monitoring accuracy of the model training process.

Referring to fig. 7, an embodiment of a model training monitoring device according to an embodiment of the present invention will be described in detail from the perspective of hardware processing.

Fig. 7 is a schematic structural diagram of a model training monitoring device 700 according to an embodiment of the present invention, which may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 710 (e.g., one or more processors) and a memory 720, one or more storage media 730 (e.g., one or more mass storage devices) for storing applications 733 or data 732. Memory 720 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations for the model training monitoring device 700. Still further, the processor 710 may be configured to communicate with the storage medium 730 to execute a series of instruction operations in the storage medium 730 on the model training monitoring device 700.

The model training monitoring device 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input-output interfaces 760, and/or one or more operating systems 731, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the model training monitoring device configuration illustrated in FIG. 7 does not constitute a limitation of the model training monitoring device, and may include more or fewer components than illustrated, or some components in combination, or a different arrangement of components.

The server referred by the invention can be an independent server, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and artificial intelligence platform and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and which may also be a volatile computer readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the model training monitoring method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

22页详细技术资料下载

Model training monitoring method, device, equipment and storage medium

相关技术

网友询问留言