Method and device for acquiring training data of solubility prediction model

文档序号：1289197 发布日期：2020-08-28 浏览：6次中文

阅读说明：本技术 溶解度预测模型的训练数据获取方法及装置 (Method and device for acquiring training data of solubility prediction model ) 是由孟金涛于 2020-07-08 设计创作，主要内容包括：本申请公开了一种溶解度预测模型的训练数据获取方法、装置、计算机设备及存储介质,属于计算机技术领域。本申请通过对每个训练数据集进行重复数据合并,确定训练数据集对应的第二溶解度数据以及各个数据的重复度；应用各个训练数据集训练模型,基于模型训练结果为训练数据集分配第二权重,由第二权重指示训练数据集的数据质量；基于数据质量高的训练数据集对应的第二溶解度数据,对待修复训练数据集进行修复,得到包含权重信息的目标训练数据。在上述方案中,应用高质量的数据进行数据修复,无需人工修改错误数据,且目标训练数据包括用于指示准确度的权重信息,准确度低的数据对应的权重小,从而降低准确度低的目标训练数据对模型训练的影响。(The application discloses a method and a device for acquiring training data of a solubility prediction model, computer equipment and a storage medium, and belongs to the technical field of computers. According to the method, repeated data merging is carried out on each training data set, and second solubility data corresponding to the training data sets and the repetition degree of each datum are determined; training the model by using each training data set, distributing a second weight to the training data set based on the model training result, and indicating the data quality of the training data set by the second weight; and repairing the training data set to be repaired based on the second solubility data corresponding to the training data set with high data quality to obtain target training data containing weight information. In the scheme, high-quality data is applied to data restoration without manually modifying error data, the target training data comprises weight information used for indicating accuracy, and the weight corresponding to the low-accuracy data is small, so that the influence of the low-accuracy target training data on model training is reduced.)

1. A method for obtaining training data of a solubility prediction model, the method comprising:

obtaining first solubility data of at least two training data sets, one first solubility data comprising a solubility value of a molecular data;

respectively merging the repeated first solubility data in each training data set to obtain second solubility data corresponding to each training data set and a first weight of each second solubility data, wherein the first weight is used for indicating the repetition degree of the first solubility data corresponding to the second solubility data;

training a solubility prediction model based on the first solubility data of each training data set, and determining a second weight corresponding to each training data set based on a model prediction result of the solubility prediction model, wherein the second weight is used for indicating the data accuracy of each training data set;

for any training data set, determining at least one training data set from the at least two training data sets based on the second weight corresponding to each training data set, as at least one reference data set corresponding to the any training data set;

and performing data restoration on any training data set based on a second weight of a reference data set corresponding to any training data set, second solubility data corresponding to the reference data set and a first weight of each second solubility data to obtain target training data, wherein one target training data comprises a solubility value of one molecular data and a target weight of the solubility value, and the target weight is used for indicating the accuracy of the solubility data.

2. The method of claim 1, wherein the combining the first solubility data repeated in each training data set to obtain the second solubility data corresponding to each training data set and the first weight of each second solubility data comprises:

for each training data set, grouping the first solubility data corresponding to the same molecular data into a group to obtain at least two groups of solubility data;

for each group of solubility data, respectively combining the first solubility data comprising the same solubility value to obtain at least one second solubility data;

determining the first weight of the second solubility data based on a number of the first solubility data included with the second solubility data.

3. The method of claim 1, wherein the training a solubility prediction model based on the first solubility data of each training data set, and the determining a second weight corresponding to each training data set based on a model prediction result of the solubility prediction model comprises:

for each training data set, training the solubility prediction model based on a first target amount of the first solubility data in the training data set to obtain a trained solubility prediction model;

for each training data set, determining model prediction accuracy of the trained solubility prediction model based on a second target amount of the first solubility data in the training data set;

and determining a second weight corresponding to each training data set based on the model prediction accuracy corresponding to each training data set, wherein the second weight is positively correlated with the model prediction accuracy.

4. The method according to claim 1, wherein the determining, for any training data set, at least one training data set from the at least two training data sets as at least one reference data set corresponding to the any training data set based on the second weight corresponding to each training data set comprises:

comparing the second weight corresponding to each training data set with the second weight corresponding to any training data set;

and acquiring the training data set of which the corresponding second weight is greater than or equal to the second weight corresponding to any training data set as a reference data set corresponding to any training data set.

5. The method according to claim 1, wherein the performing data restoration on any training data set based on the second weight of the reference data set corresponding to the training data set, the second solubility data corresponding to the reference data set, and the first weight of each second solubility data to obtain target training data comprises:

generating a repair data set based on the second weight corresponding to the reference data set, the second solubility data corresponding to the reference data set, and the first weight of each of the second solubility data, the repair data set including the second solubility data corresponding to the reference data set and a third weight of each of the second solubility data, the third weight indicating an accuracy of the second solubility data;

and performing data restoration on any training data set based on the restoration data set to obtain target training data.

6. The method of claim 5, wherein generating a repair data set based on the second weight for the reference data set, the second solubility data for the reference data set, and the first weight for each of the second solubility data comprises:

multiplying a first weight of the second solubility data by a second weight corresponding to a reference data set to which the second solubility data belongs to obtain the third weight of the second solubility data;

and generating the repair data set based on the second solubility data corresponding to the at least one reference data set and the third weight of each second solubility data.

7. The method according to claim 5, wherein the performing data restoration on any training data set based on the restoration data set to obtain target training data comprises:

determining the molecular data corresponding to the second solubility data of any training data as molecular data to be repaired;

grouping the second solubility data corresponding to the repair data set based on the molecular data to be repaired to obtain a repair data group corresponding to each molecular data to be repaired;

and for each repair data set, performing data repair on any training data set based on the second solubility data in the repair data set and the third weight of the second solubility data to obtain at least one target training data.

8. The method of claim 7, wherein for each repair data set, performing data repair on any training data set based on the second solubility data in the repair data set and a third weight of the second solubility data to obtain at least one target training data set comprises:

for the second solubility data in each repair data set, sorting the second solubility data according to the solubility values in the second solubility data;

sequentially acquiring the solubility difference values of two adjacent second solubility data from the sequenced second solubility data;

comparing the solubility value to a first threshold value;

determining the at least one target training data based on the comparison result, the two adjacent second solubility data and a third weight of each second solubility data.

9. The method of claim 8, wherein determining the at least one target training data based on the comparison result, the two adjacent second solubility data, and a third weight of each second solubility data comprises:

in response to the solubility difference being less than or equal to the first threshold, merging the solubility values of the two adjacent second solubility data into a solubility value of one target training data, determining a sum of third weights of the two adjacent second solubility data as a target weight of the one target training data;

and in response to the solubility difference value being larger than the first threshold value, determining the solubility values of the two adjacent second solubility data and the third weight of each second solubility data as target training data respectively.

10. The method according to claim 1, wherein after the data recovery is performed on any training data set based on the second weight of the reference data set corresponding to the training data set, the second solubility data corresponding to the reference data set, and the first weight of each second solubility data, and target training data is obtained, the method further comprises:

and based on a second threshold value, performing regularization processing on the target weight of the target training data.

11. The method of claim 1, wherein before the combining the first solubility data repeated in each training data set separately to obtain the second solubility data corresponding to each training data set and the first weight of each second solubility data, the method further comprises:

screening the first solubility data based on at least one of a molecule normalization result of the molecule data corresponding to the first solubility data, a molecule composition, and data measurement environment information of the first solubility data;

and based on the screened first solubility data, executing the step of respectively merging the repeated first solubility data in each training data set to obtain second solubility data corresponding to each training data set and first weights of the second solubility data.

12. The method of claim 11, wherein the screening the first solubility data based on at least one of a molecular normalization result of molecular data corresponding to the first solubility data, a molecular composition, and data measurement environment information of the first solubility data comprises at least one of:

carrying out molecular structure standardization on the molecular data corresponding to the first solubility data, and removing the first solubility data corresponding to the molecular data which is not subjected to molecular structure standardization;

acquiring data measurement environment information of the first solubility data, and removing the first solubility data of which the data measurement environment information does not meet a target condition;

and removing the first solubility data corresponding to the molecular data of the target particles in the molecular composition based on the molecular composition of the molecular data corresponding to the first solubility data.

13. A training data acquisition apparatus for a solubility prediction model, the apparatus comprising:

a first acquisition module for acquiring first solubility data of at least two training data sets, one first solubility data comprising a solubility value of a molecular data;

a second obtaining module, configured to respectively combine the repeated first solubility data in each training data set to obtain second solubility data corresponding to each training data set and a first weight of each second solubility data, where the first weight is used to indicate a repetition degree of the first solubility data corresponding to the second solubility data;

a first determining module, configured to train a solubility prediction model based on the first solubility data of each training data set, and determine a second weight corresponding to each training data set based on a model prediction result of the solubility prediction model, where the second weight is used to indicate data accuracy of each training data set;

a second determining module, configured to determine, for any training data set, at least one training data set from the at least two training data sets based on a second weight corresponding to each training data set, as at least one reference data set corresponding to the any training data set;

a data restoration module, configured to perform data restoration on any training data set based on a second weight of a reference data set corresponding to any training data set, second solubility data corresponding to the reference data set, and a first weight of each second solubility data, to obtain target training data, where one target training data includes a solubility value of a molecular data and a target weight of the solubility value, and the target weight is used to indicate accuracy of the solubility data.

14. A computer device comprising one or more processors and one or more memories having stored therein at least one program code, the at least one program code loaded into and executed by the one or more processors to perform operations performed by a training data acquisition method for a solubility prediction model in accordance with any one of claims 1 to 12.

15. A computer-readable storage medium having at least one program code stored therein, the at least one program code being loaded into and executed by a processor to perform operations performed by a training data acquisition method of a solubility prediction model according to any one of claims 1 to 12.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for obtaining training data of a solubility prediction model, a computer device, and a storage medium.

Background

The determination of the solubility of molecules is an important link in drug development, and in order to improve the speed of drug development, a data-driven artificial intelligence method is usually applied, that is, a trained solubility prediction model is applied to predict the solubility of molecules. The solubility prediction model can be trained based on existing molecular solubility data sets such as OCHEM, AQSOL, ESOL and the like, but the data sets contain a small number of training data, and some data labeling errors exist, so that the data quality is low.

At present, when training data is obtained, data restoration is usually performed on each data in an existing solubility data set manually, and model training is performed by using the solubility data set after data restoration as a training data set. For example, the solubility data set after data recovery, that is, the training data set of the model, may be obtained by manually referring to relevant documents and then correcting each solubility data in the solubility data set based on data described in the documents. However, the method for acquiring the training data is time-consuming and labor-consuming, has extremely low efficiency, and cannot ensure that each error data in the solubility data set is repaired in the data repairing process, that is, each training data in the acquired training data set is correct, and the error training data in the training data set still affects the model training result. Therefore, when the solubility model is trained, how to train data with higher accuracy and reduce the influence of inaccurate data on the model training result is an important research direction.

Disclosure of Invention

The embodiment of the application provides a method and a device for acquiring training data of a solubility prediction model, computer equipment and a storage medium, which can improve the efficiency of acquiring the training data and the accuracy of the training data. The technical scheme is as follows:

in one aspect, a method for obtaining training data of a solubility prediction model is provided, and the method includes:

obtaining first solubility data of at least two training data sets, one first solubility data comprising a solubility value of a molecular data;

for any training data set, determining at least one training data set from the at least two training data sets based on the second weight corresponding to each training data set, and using the at least one training data set as at least one reference data set corresponding to the any training data set;

and performing data restoration on any training data set based on the second weight of the reference data set corresponding to the training data set, the second solubility data corresponding to the reference data set and the first weight of each second solubility data to obtain target training data, wherein one target training data comprises a solubility value of molecular data and a target weight of the solubility value, and the target weight is used for indicating the accuracy of the solubility data.

In one possible implementation, the regularizing the target weights of the target training data based on the second threshold includes:

comparing the target weight of the target training data to the second threshold;

in response to the target weight being greater than the second threshold, setting the value of the target weight to the second threshold; in response to the target weight being less than or equal to the second threshold, not modifying the target weight;

and dividing the target weight by the second threshold value to obtain the regularized target weight.

In one aspect, a training data obtaining apparatus for a solubility prediction model is provided, the apparatus including:

a first acquisition module for acquiring first solubility data of at least two training data sets, one first solubility data comprising a solubility value of a molecular data;

and the data restoration module is used for performing data restoration on any training data set based on the second weight of the reference data set corresponding to any training data set, the second solubility data corresponding to the reference data set and the first weight of each second solubility data to obtain target training data, wherein one target training data comprises a solubility value of one molecular data and a target weight of the solubility value, and the target weight is used for indicating the accuracy of the solubility data.

In one possible implementation manner, the second obtaining module is configured to:

for each training data set, dividing the first solubility data corresponding to the same molecular data into a group to obtain at least two groups of solubility data;

for each group of solubility data, respectively merging the first solubility data comprising the same solubility value to obtain at least one second solubility data;

determining the first weight of the second solubility data based on a number of the first solubility data included in the second solubility data.

In one possible implementation, the first determining module is configured to:

for each training data set, training the solubility prediction model based on the first solubility data of the first target quantity in the training data set to obtain a trained solubility prediction model;

for each training data set, determining model prediction accuracy of the trained solubility prediction model based on a second target amount of the first solubility data in the training data set;

In one possible implementation manner, the second determining module is configured to:

comparing the second weight corresponding to each training data set with the second weight corresponding to any training data set;

In one possible implementation, the data repair module includes:

a generation submodule, configured to generate a repair dataset based on the second weight corresponding to the reference dataset, the second solubility data corresponding to the reference dataset, and the first weight of each of the second solubility data, where the repair dataset includes the second solubility data corresponding to the reference dataset and a third weight of each of the second solubility data, and the third weight is used to indicate accuracy of the second solubility data;

and the repair submodule is used for performing data repair on any training data set based on the repair data set to obtain target training data.

In one possible implementation, the generating submodule is configured to:

multiplying the first weight of the second solubility data by a second weight corresponding to a reference data set to which the second solubility data belongs to obtain the third weight of the second solubility data;

the repair data set is generated based on the second solubility data corresponding to the at least one reference data set and the third weight of each second solubility data.

In one possible implementation, the repair submodule includes:

the data determining unit is used for determining the molecular data corresponding to the second solubility data of any training data as the molecular data to be repaired;

a data group obtaining unit, configured to group the second solubility data corresponding to the repair data set based on the to-be-repaired molecular data, so as to obtain a repair data group corresponding to each to-be-repaired molecular data;

and the data restoration unit is used for carrying out data restoration on any training data set based on the second solubility data in the restoration data set and the third weight of the second solubility data to obtain at least one target training data for each restoration data set.

In one possible implementation, the data repair unit includes:

a sorting subunit, configured to, for the second solubility data in each repair data set, sort the second solubility data according to the size of the solubility value in the second solubility data;

a difference obtaining subunit, configured to sequentially obtain, from the sorted second solubility data, a solubility difference between two adjacent second solubility data;

a comparison subunit for comparing the solubility value with a first threshold value;

a data determination subunit, configured to determine the at least one target training data based on the comparison result, the two adjacent second solubility data, and a third weight of each second solubility data.

In one possible implementation, the data determination subunit is configured to:

in response to the solubility difference being less than or equal to the first threshold, merging the solubility values of the two adjacent second solubility data into a solubility value of one target training data, determining the sum of the third weights of the two adjacent second solubility data as the target weight of the one target training data;

and in response to the solubility difference being larger than the first threshold, determining the solubility values of the two adjacent second solubility data and the third weight of each second solubility data as target training data respectively.

In one possible implementation, the apparatus further includes:

and the regularization module is used for regularizing the target weight of the target training data based on a second threshold value.

In one possible implementation, the regularization module is to:

comparing the target weight of the target training data to the second threshold;

and dividing the target weight by the second threshold value to obtain the regularized target weight.

In one possible implementation, the apparatus further includes:

a screening module for screening the first solubility data based on at least one of a molecule normalization result, a molecule composition, and data measurement environment information of the first solubility data of the molecule data corresponding to the first solubility data; and based on the screened first solubility data, executing the step of respectively merging the repeated first solubility data in each training data set to obtain second solubility data corresponding to each training data set and first weights of the second solubility data.

In one possible implementation, the screening module is configured to perform at least one of:

and removing the first solubility data corresponding to the molecular data of the target particles from the molecular composition based on the molecular composition of the molecular data corresponding to the first solubility data.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one program code stored therein, the at least one program code loaded into and executed by the one or more processors to implement the operations performed by the training data acquisition method for the solubility prediction model.

In one aspect, a computer-readable storage medium having at least one program code stored therein is provided, the at least one program code being loaded into and executed by a processor to perform operations performed by a training data acquisition method of the solubility prediction model.

In one aspect, a computer program product is provided that includes at least one program code stored in a computer readable storage medium. The at least one program code is read from the computer-readable storage medium by a processor of the computer device, and the at least one program code is executed by the processor to cause the computer device to implement the operations performed by the training data acquisition method of the solubility prediction model.

According to the technical scheme provided by the embodiment of the application, the second solubility data corresponding to each training data set and the repetition degree of each datum are determined by carrying out repeated data combination on each training data set; training the model by applying each training data set, distributing a second weight to the training data set based on the training result of the model, and indicating the data quality of the training data set by the second weight; and performing data restoration on the training data set to be restored based on second solubility data corresponding to the training data set with high data quality to obtain target training data containing weight information. In the scheme, data restoration is performed by using high-quality data, so that error data does not need to be modified manually, target training data comprises weight information used for indicating the accuracy of the data, and the weight corresponding to the low-accuracy data is small, so that the influence of the low-accuracy target training data on model training can be reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a method for acquiring training data of a solubility prediction model according to an embodiment of the present application;

fig. 2 is a flowchart of a training data obtaining method of a solubility prediction model according to an embodiment of the present application;

fig. 3 is a specific flowchart of a method for acquiring training data of a solubility prediction model according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a correspondence relationship between a training data set to be repaired and a reference data set according to an embodiment of the present application;

fig. 5 is a flowchart of a cluster repair algorithm provided in an embodiment of the present application;

FIG. 6 is a flow chart of data repair provided by an embodiment of the present application;

FIG. 7 is a diagram illustrating the results of a model training provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of a training data obtaining apparatus of a solubility prediction model according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning and the like. In the artificial intelligence technology, various deep neural network models are often applied, and are trained through a large amount of training data, so that the deep neural network models learn the characteristics of the training data, and inference, prediction and the like are carried out based on the learned characteristics, and therefore, the performance of the deep neural network models is closely related to the quality of the training data. In the embodiment of the application, the training data is repaired, so that the training data with higher accuracy is obtained, and further, when model training is performed on the basis of the training data with higher accuracy, a deep neural network model with better performance can be obtained.

Fig. 1 is a schematic environment for implementing a method for obtaining training data of a solubility prediction model according to an embodiment of the present disclosure. The implementation environment includes: a terminal 110 and a solubility prediction platform 140.

The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 110 is installed and operated with an application program supporting the solubility prediction. The application program may be a drug development application program or the like. The terminal 110 may be generally referred to as one of a plurality of terminals, and the embodiment is only illustrated by the terminal 110.

The solubility prediction platform 140 is used to provide background services for applications that support solubility prediction. Optionally, the solubility prediction platform 140 undertakes primary solubility prediction work and the terminal 110 undertakes secondary solubility prediction work; or, the solubility prediction platform 140 undertakes the secondary solubility prediction work, and the terminal 110 undertakes the primary solubility prediction work; alternatively, the solubility prediction platform 140 or the terminal 110, respectively, may undertake the segmentation task separately. Optionally, the solubility prediction platform 140 comprises: an access server, a solubility prediction server and a database. The access server is used to provide access services for the terminal 110. The solubility prediction server is used for providing background services related to solubility prediction. The solubility prediction server may be one or more. When the solubility prediction servers are multiple, at least two solubility prediction servers exist for providing different services, and/or at least two solubility prediction servers exist for providing the same service, for example, the same service is provided in a load balancing manner, which is not limited in the embodiment of the present application. The solubility prediction server can be provided with a solubility prediction model, and the solubility prediction server provides support for the training and application process of the model. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

The terminal 110 and the solubility prediction platform 140 may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present invention.

Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals may be only one, or several tens or hundreds, or more, in which case the object detection system further includes other terminals. The number of terminals and the type of the device are not limited in the embodiments of the present application.

The technical scheme provided by the embodiment of the application can be combined with various application scenes, in the embodiment of the application, the training data of the solubility prediction model is obtained by applying the training data obtaining method provided by the technical scheme, in the training data obtaining process, manual data restoration is not needed, clustering restoration is carried out on repeated data in each training data set, a large amount of training data with high accuracy can be obtained, each training data corresponds to a target weight, the higher the accuracy of the training data is, the larger the corresponding target weight is, and in the training data application process, the influence of the training data with low accuracy on the model training result can be reduced through smaller weights.

Fig. 2 is a flowchart of a training data obtaining method of a solubility prediction model according to an embodiment of the present disclosure. In the embodiment of the present application, the training data obtaining method is described by taking the computer device as an execution subject, and with reference to fig. 2, the method may specifically include the following steps:

201. the computer device obtains first solubility data for at least two training data sets, one first solubility data comprising a solubility value for one molecular data.

The training data set may be a data set containing solubility values stored in a computer device, a data set acquired by the computer device from a network, or a data set constructed based on a plurality of solubility values designed for the computer.

In one possible implementation, the computer device, upon receiving a training data acquisition instruction, acquires first solubility data of a plurality of training data sets in response to the training data acquisition instruction. The embodiment of the present application does not limit the triggering manner of the training data acquisition instruction and the specific acquisition manner of the training data set.

202. The computer device respectively merges the repeated first solubility data in each training data set to obtain second solubility data corresponding to each training data set and a first weight of each second solubility data, wherein the first weight is used for indicating the repetition degree of the first solubility data corresponding to the second solubility data.

In the embodiment of the present application, a plurality of first solubility data in one training data set may correspond to the same molecular data, and the solubility values recorded in the first solubility data may be the same or different. In one possible implementation, the computer device may combine the repeated first solubility data based on the solubility values of the molecular data in the first solubility data to obtain second solubility data, and assign a first weight to the second solubility data based on the repetition of the first solubility data corresponding to each second solubility data.

In the embodiment of the application, for each training data set, duplicate data merging is performed to obtain second solubility data and a first weight capable of indicating the data repetition degree, and then a subsequent training data acquisition step is executed based on the second solubility data and the second weight, so that redundant data in a subsequent data processing process can be reduced, the data processing amount is reduced, and the data processing efficiency is improved.

203. The computer device trains the solubility prediction model based on the first solubility data of each training data set, and determines a second weight corresponding to each training data set based on a model prediction result of the solubility prediction model, wherein the second weight is used for indicating the data accuracy of each training data set.

The solubility prediction model may be a model constructed based on a deep neural network, and the specific structure of the solubility prediction model is not limited in the embodiment of the present application. For example, the solubility prediction model may be a Chemprop (chemical substance) model.

Taking the determination of the second weight corresponding to one training data set as an example, in one possible implementation manner, the computer device performs model training based on the first solubility data in the training data set to obtain a trained solubility prediction model; and inputting the first solubility data of the training data set into the trained solubility prediction model to obtain a model prediction result. The computer device may determine a second weight for each training data set based on the model training results for each training data set. For example, the model prediction result may include a model prediction accuracy, the computer device may determine the second weight based on the model prediction accuracy, and if the model prediction accuracy corresponding to a certain training data set is higher, the computer device may determine that the quality of the first solubility data in the certain training data set is higher, and the computer device may assign a larger second weight to the certain training data set.

The above description of the second weight determination method is only an exemplary description, and the embodiment of the present application does not limit the description. In the embodiment of the application, based on the model training result, a larger second weight is assigned to the training data set with higher data accuracy, and a smaller second weight is assigned to the training data set with lower data accuracy, so that the influence of inaccurate data on the subsequent data repairing process can be reduced.

204. For any training data set, the computer device determines at least one training data set from the at least two training data sets as at least one reference data set corresponding to the any training data set based on the second weight corresponding to each training data set.

Wherein the reference data set can be used for data recovery of the any training data set.

In a possible implementation manner, the computer device may select a corresponding training data set with a larger second weight as a reference data set of any one of the training data sets, so as to ensure that a data repairing effect of a subsequent data repairing process is good. Of course, the computer device may also determine the reference data set based on other conditions, which is not limited in the embodiments of the present application.

205. The computer device performs data restoration on any training data set based on a second weight corresponding to a reference data set corresponding to the training data set, second solubility data corresponding to the reference data set, and a first weight of each second solubility data to obtain target training data, wherein one target training data comprises a solubility value of molecular data and a target weight of the solubility value, and the target weight is used for indicating the accuracy of the solubility data.

In a possible implementation manner, the computer device may apply, by using a cluster repair algorithm, the second weight corresponding to the reference data set, the second solubility data corresponding to the reference data set, and the first weight of each of the second solubility data to repair the any training data set, that is, repair the second solubility data corresponding to the any training data set to obtain target training data with higher data accuracy, where the target training data includes a target weight for indicating the accuracy of a solubility value, and when the target training data is applied to model training, the influence of the data with lower accuracy on a model training result may be reduced by using the target weight. It should be noted that, the embodiment of the present application does not limit the specific method for data repair.

The foregoing embodiment is a brief introduction to an implementation manner of the present application, and fig. 3 is a specific flowchart of a training data acquisition method of a solubility prediction model provided in the embodiment of the present application, and with reference to fig. 3, a computer device is taken as an execution subject to describe the training data acquisition process:

301. a computer device obtains first solubility data for at least two training data sets.

In this embodiment, the first solubility data is further labeled with a molecular identifier of the molecular data, data measurement environment information of the solubility value, and the like, where the molecular identifier may be used to uniquely indicate a molecule, the molecular identifier may be a chemical formula of the molecular data, a name of the molecular data, and the like, and the data measurement environment information may include temperature, PH value, and the like. Of course, other information may be marked in the first solubility data, which is not limited in the embodiments of the present application.

In the embodiment of the present application, the number and specific types of training data sets acquired by the computer device are not limited. In the embodiments of the present application, 6 training data sets of AQUA, PHYS, ESOL, OCHEM, AQSOL, and CHEMBL are taken as examples for explanation. The training data sets of AQUA, PHYS and ESOL comprise less data and higher data quality, and the training data sets of OCHEM, AQSOL and CHEMBL comprise more training data and have poorer data quality. The 6 training data sets are thermodynamic data sets.

302. The computer device screens the first solubility data in each of the training data sets.

In an embodiment of the present application, the computer device may screen the first solubility data based on at least one of a molecular normalization result of molecular data corresponding to the first solubility data, a molecular composition, and data measurement environment information of the first solubility data. That is, the computer device may perform data filtering on the first solubility data before performing data recovery. The computer device may use the training data sets as units to respectively screen the first solubility data in each training data set, and the following description will take the example that the computer device screens the first solubility data in one training data set as an example.

In one possible implementation, the computer device may perform data screening based on a SMILES (simplified molecular Input Line Entry Specification) standardized result of the molecular data. Wherein, SMILES is a specification for explicitly describing the molecular structure by ASCII character strings, and each molecular data corresponds to a SMILES expression. In the embodiment of the present application, the computer device normalizes the molecular structure of the molecular data corresponding to the first solubility data, and removes the first solubility data corresponding to the molecular data that is not normalized by the molecular structure from the training data set. For example, the computer device may apply MolVS (molecular normalization tool), input a chemical formula of the molecular data corresponding to each of the first solubility data into MolVS, output a SMILES expression of the molecular data in response to MolVS, determine that the molecular data is normalized by a molecular structure, and determine that the molecular data is not normalized by a molecular structure in response to MolVS not outputting a SMILES expression of the molecular data. In the embodiment of the application, the first solubility data is screened based on whether the first solubility data can be standardized by a molecular structure, and the first solubility data corresponding to different molecular data standardized by the molecular structure is removed, so that the universality of training data can be improved, and the phenomenon that some models do not have a molecular structure standardization function and have error reporting when the training data are applied to training is avoided.

In one possible implementation, the computer device may filter the first solubility data based on data measurement environment information. For example, the computer device may obtain the data measurement environment information recorded in the first solubility data, and remove the first solubility data from the training data set where the data measurement environment information does not satisfy the target condition. The target condition may be set by a developer, which is not limited in the embodiment of the present application, for example, the target condition may be set to a data measurement temperature of 25 ± 5 ℃ and a PH of 7 ± 2. In the embodiment of the application, the first solubility data is screened based on the data measurement environment, some training data obtained in the extreme experimental environment can be removed, and the training data may have a larger difference in value with the training data obtained in the normal experimental environment, so that the training data measured in the extreme experimental environment is removed, and the influence of the data on the subsequent model training process can be avoided.

In one possible implementation, the computer device may screen the first solubility data based on molecular constituents of molecular data to which the first solubility data set corresponds. For example, the computer device may remove the first solubility data corresponding to the molecular data including the target particle in the molecular composition based on the molecular composition of the molecular data corresponding to the first solubility data. The target particles may be set by developers, but is not limited thereto, for example, In the process of drug development, the molecular data used should be non-toxic, and the target particles may be set As heavy metal particles such As U, Ge, Pr, La, Dy, Ti, Zr, Rh, Lu, Mo, Sm, Sb, Nd, Gd, Cd, Ce, In, Pt, Sb, As, Ir, Ba, B, Hg, Se, Sn, Ti, Fe, Si, Al, Bi, Pb, Pd, Ag, Au, Cu, Pt, Co, Ni, Ru, Mg, Zn, Mn, Cr, Ca, K, Li, and the like, and may further include groups that are used with a very low frequency In the process of drug development such As SF5, SF6, and the like. In the embodiment of the application, based on the molecular composition of the molecular data corresponding to the training data and the actual application scenario of the training data, the first solubility data corresponding to the molecular data which is not used in the actual application scenario is removed, so that the usability of the training data set can be improved, and furthermore, when the filtered first solubility data is applied to model training, the trained model can better conform to the actual application scenario.

It should be noted that the above description of the first solubility data screening method is merely an exemplary description of several implementation manners, and the embodiment of the present application does not limit the specific application manner of the first solubility data screening. In the embodiment of the present application, the plurality of first solubility data screening methods may be combined arbitrarily, and the specific combination manner and execution sequence of the first solubility data screening methods are not limited in the embodiment of the present application.

303. And the computer equipment respectively merges the repeated first solubility data in each training data set to obtain second solubility data corresponding to each training data set and the first weight of each second solubility data.

In one possible implementation, for each training data set, the computer device may group first solubility data corresponding to the same molecular data into one group, resulting in at least two groups of solubility data; for each set of solubility data, the first solubility data comprising the same solubility value is combined separately to obtain at least one second solubility data. In the embodiment of the present application, a plurality of first solubility data in the training data set may correspond to the same molecular data, and the solubility values in the first solubility data may be the same or different. For example, there may be 4 first solubility data sets in a training data set each corresponding to molecular data A, wherein first solubility data 1 records a solubility value of 9/100g of water, first solubility data 2 records a solubility value of 9.01/100g of water, first solubility data 3 records a solubility value of 9.5/100g of water, and first solubility data 4 records a solubility value of 9.7/100g of water. The computer device may merge the first solubility data 1 with the first solubility data 2 as a second solubility data, and the first solubility data 3, 4 as a second solubility data, respectively. In the embodiment of the present application, if the absolute value of the difference between the solubility values included in two first solubility data corresponding to the same molecule data is less than 0.01, it can be determined that the two first solubility data are the same.

In one possible implementation, the computer device may determine a first weight corresponding to the second solubility data based on a number of first solubility data included in the second solubility data. The first weight is positively correlated with the number of first solubility data included in the second solubility data, that is, the first weight is used for indicating the repeatability of the first solubility data corresponding to the second solubility data. In this embodiment, the total weight of the second solubility data corresponding to each molecular data is 1, and if one molecular data corresponds to three second solubility data, where the second solubility data 1 is obtained by combining two first solubility data, and the second solubility data 2 and the second solubility data 3 are respectively determined by one first solubility data, the computing device may assign a first weight to each second solubility data based on the total weight and the number of the first solubility data included in each second solubility data, where the first weight of the second solubility data 1 is 0.5, and the first weights of the second solubility data 2 and the second solubility data 3 are both 0.25.

It should be noted that the above description of the method for performing duplicate data merging on the first solubility data and determining the first weight is only an exemplary description, and the embodiment of the present application does not limit which method is specifically used to perform the duplicate data merging processing on the first solubility data and determine the first weight. In the embodiment of the application, the data redundancy of the training data set can be reduced by merging the repeated data. Assigning weights to the second solubility data based on the frequency of occurrence of the same first solubility data may reduce the impact of erroneous data on model training. For example, if the probability that the first solubility data with a high frequency of occurrence is accurate data is high, the first weight value corresponding to the second solubility data obtained by combining the first solubility data is also high; the probability that the first solubility data with higher frequency of occurrence and lower frequency of occurrence are accurate data is smaller, and the first weight value corresponding to the second solubility data obtained by combining the first solubility data is also smaller, that is, the smaller the weight corresponding to the data which possibly has errors is, the smaller the influence of the data on model training is.

304. The computer device trains the solubility prediction model based on the first solubility data of each training data set, and determines a second weight corresponding to each training data set based on a training result of the solubility prediction model.

Wherein the second weight is used to indicate the data accuracy of each training data set.

In one possible implementation, for each training data set, first, the computer device may train the solubility prediction model based on a first target amount of the first solubility data in the training data set to obtain a trained solubility prediction model; then, determining model prediction accuracy of the trained solubility prediction model based on a second target amount of the first solubility data in the training dataset; finally, a second weight corresponding to each training data set is determined based on the model prediction precision corresponding to each training data set. The second weight is positively correlated with the model prediction accuracy, that is, the higher the model prediction accuracy of the solubility prediction model obtained by training a certain training data set is, the larger the second weight corresponding to the certain training data set is. In the embodiment of the present application, the maximum value of the second weight is 1, and both the first target number and the second target number may be set by a developer, which is not limited in the embodiment of the present application, for example, the ratio between the first target number and the second target number may be 8: 2. Of course, the first solubility data in a training data set may also be divided according to a ratio of 8:1:1, wherein 80% of the first solubility data is used for model training, 10% of the first solubility data is used for model testing, and 10% of the first solubility data is used for model prediction accuracy evaluation. In a possible implementation manner, taking 6 training data sets of AQUA, PHYS, ESOL, OCHEM, AQSOL, and CHEMBL as an example, the solubility prediction model is trained based on the 6 training data sets, and the model training results corresponding to the training data sets are shown in table 1.

TABLE 1

Among them, RMSE (Root Mean Square Error) may be used to indicate the model prediction accuracy, and its value is inversely related to the model prediction accuracy. As can be seen from the data in table 1, in the two cases where the data are randomly divided to be divided based on the Scaffold (molecular fragment), the model training effect corresponding to the three training data sets of AQUA, PHYS, and ESOL is better, the model training effect corresponding to the two training data sets of AQSOL and CHEMBL is worse, and based on the data, the second weights corresponding to the training data sets are respectively determined to be 1, 0.85, 0.5, and 0.4.

It should be noted that the above description of the method for dividing the first solubility data and the method for determining the model training accuracy is only an exemplary description, and the embodiment of the present application does not limit which method is specifically used to divide the first solubility data and determine the model training accuracy.

In the embodiment of the application, the second weight is distributed to each training data set, a larger weight value is distributed to the training data set with a better corresponding model training effect, and a smaller weight value is distributed to the training data set with a poorer corresponding model training effect, so that the influence of data with poorer quality on the model training result can be reduced.

305. For any training data set, the computer device determines at least one training data set from the at least two training data sets as at least one reference data set based on the corresponding second weight of each training data set.

In one possible implementation, the computer device may compare the second weight corresponding to each of the training data sets with the second weight corresponding to any of the training data sets; and acquiring the training data set of which the corresponding second weight is greater than or equal to the second weight corresponding to any training data set as the reference data set. That is, the training data set with higher weight is used as the reference data set. Taking 6 training data sets of AQUA, PHYS, ESOL, OCHEM, AQSOL, and CHEMBL as an example, the second weights of the training data sets are 1, 0.85, 0.5, and 0.4, respectively, and for the training data set OCHEM, the corresponding reference data sets are AQUA, PHYS, ESOL, and OCHEM.

Taking 6 training data sets of AQUA, PHYS, ESOL, OCHEM, AQSOL, and CHEMBL as examples, data cross exists among the data sets, so that data restoration can be performed on a certain training data set to be restored by determining a reference data set and based on repeated data between the reference data set and the certain training data set to be restored. The data in table 2 and table 3 show the data intersection between the training data sets, where the data in table 2 is the data repetition ratio between the training data sets, and the data in table 3 is the data non-repetition ratio between the training data sets.

TABLE 2

AQUA

PHYS

ESOL

OCHEM

AQSOL

CHEMBL

AQUA

100％

53.13％

51.07％

97.86％

66.03％

0.53％

PHYS

34.78％

100％

23.99％

77.81％

65.67％

1.75％

ESOL

59.95％

43.01％

100％

97.86％

65.23％

0.72％

OCHEM

33.39％

36.83％

25.94％

100％

66.49％

1.02％

AQSOL

9.94％

15.1％

8.37％

30.94％

100％

1.76％

CHEMBL

0.02％

0.11％

0.03％

0.14％

0.5％

100％

TABLE 3

As can be seen from the data in tables 2 and 3, in each of the 6 training data sets, a large amount of molecular data has the same solubility value, and a large amount of molecular data has different solubility values. In the embodiment of the application, data with the same solubility value from different data sets are applied to data restoration, so that the confidence of the restored solubility value can be improved.

Fig. 4 is a schematic diagram of a correspondence relationship between a training data set to be repaired and a reference data set provided in an embodiment of the present application, referring to fig. 4, in a possible implementation manner, training data sets may be grouped based on second weights of the training data sets, for example, the training data sets AQUA, PHYS, and ESOL with higher second weights may be used as a first group 401, the training data set OCHEM may be used as a second group 402 alone, and the training data sets AQSOL, and CHEMBL with lower second weights are used as a third group 403. In this embodiment of the present application, the data repairing may include intra-group data repairing and inter-group data repairing, where the intra-group data repairing is to acquire a training data set of the same group as a reference data set, and the inter-group data repairing is to acquire a training data set of another group as a reference data set. In the embodiment of the present application, the set of training data sets with higher weight may be used to repair the set of training data sets with lower weight, for example, the first group 401 may be used to repair the second group 402 and the third group 403, and the second group 402 may be used to repair the third group 403. Taking the training data set OCHEM as an example, when data recovery is performed on the training data set OCHEM, the reference data set of the training data set OCHEM may include a training data set that is the same as the training data set OCHEM, that is, itself, and may also include each training data set in the first group 401.

In the embodiment of the application, the reference data set is constructed based on the training data set with higher weight, that is, other data are repaired by applying the data with higher accuracy, so that a better data repairing effect can be obtained.

306. The computer device generates a repair data set based on the second weight corresponding to the reference data set, the second solubility data corresponding to the reference data set, and the first weight of each of the second solubility data.

Wherein the repair data set includes second solubility data corresponding to the reference data set and a third weight for each of the second solubility data, the third weight indicating an accuracy of the second solubility data.

In one possible implementation, the computer device may multiply a first weight of the second solubility data by a second weight corresponding to a reference data set to which the second solubility data belongs to obtain the third weight of the second solubility data; and generating the repair data set based on the second solubility data corresponding to the at least one reference data set and the third weight of each second solubility data, wherein the repair data set is used for repairing any training data set to be repaired. It should be noted that the above description of the method for constructing the repair data set is only an exemplary description, and the embodiment of the present application is not limited to specifically adopting that method to construct the repair data set.

307. And the computer equipment groups the repairing data sets based on the molecular data to be repaired corresponding to the second solubility data in any training data set to obtain a repairing data group corresponding to each molecular data to be repaired.

In a possible implementation manner, the computer device may determine the molecular data corresponding to the second solubility data of any training data as to-be-repaired molecular data, and group the repair data set by using the molecular data as a unit, that is, group the second solubility data corresponding to the repair data set based on the to-be-repaired molecular data, to obtain one repair data group corresponding to each of the to-be-repaired molecular data. For example, if the molecular data corresponding to the second solubility data of any training data set is molecular data a and molecular data B, the computer device may obtain at least one second solubility data corresponding to the molecular data a from the second solubility data included in the repair data set as a repair data set, and obtain at least one second solubility data corresponding to the molecular data B as a repair data set.

308. For each repair data set, the computer device performs data repair on any training data set based on the second solubility data in the repair data set and the third weight of the second solubility data to obtain at least one target training data.

In a possible implementation manner, for the second solubility data included in each repair data set, the computer device may sort the second solubility data according to the size of the solubility value in the second solubility data, sequentially obtain a solubility difference value of two adjacent second solubility data from the sorted second solubility data, compare the solubility difference value with a first threshold, and determine at least one target training data based on the comparison result, the two adjacent second solubility data and a third weight thereof. The first threshold may be set by a developer, for example, the first threshold may be set to 0.5, and the specific value of the first threshold is not limited in this embodiment of the application. In response to the solubility difference being less than or equal to the first threshold, the solubility values of the two adjacent second solubility data are combined into a solubility value of one target training data, and the sum of the third weights of the two adjacent second solubility data is determined as the target weight of the one target training data. For example, the computer device may perform data merging by weighted averaging of the solubility values of the two second solubility data, that is, the solubility value of each second solubility data is multiplied by the corresponding third weight, and the multiplication results are added to obtain the solubility value of the target training data. Taking the ith second solubility data as an example, the solubility value of the ith second solubility data is represented as S (i), the corresponding third weight is represented as W (i), and if the solubility values of the ith and i +1 th second solubility data are combined, the solubility value of the obtained target training data is represented as S (i) × W (i) + S (i +1) × W (i + 1). In response to the solubility difference being greater than the first threshold, two of the target training data are determined based on the solubility values of the two adjacent second solubility data and their corresponding third weights.

In the data merging process, since the target weight of the target training data is obtained by accumulating the third weights, if the number of the accumulated third weights is too large, the value of the target weight is large, and the target training data may have a large influence on the model training result in the model training process, for example, overfitting the model. In this embodiment, in order to avoid that the training result of the model is affected due to the fact that the weight of a certain target training data is too large, the computer device may perform regularization processing on the target weight of each target training data based on the second threshold. In one possible implementation, the computer device may compare the target weight of the target training data to a second threshold. In response to the target weight being greater than the second threshold, setting the value of the target weight to the second threshold; in response to the target weight being less than or equal to the second threshold, the target weight is not modified. The computer device may divide the target weight by the second threshold to obtain a regularized target weight. The second threshold may be set by a developer, and is not limited in this embodiment of the application.

Fig. 5 is a flowchart of a cluster repair algorithm provided in an embodiment of the present application, and the data repair process is described with reference to fig. 5. Taking the reference data set D (0), D (1) … … D (n-1), and the data set to be repaired is D (n-1) as an example, as shown in (a) of fig. 5, first, the computer device may first perform step 501 of constructing the repair data set D based on the reference data set D (0), D (1) … … D (n-1), that is, the computer device performs the content of step 306; then, the computer device executes step 502 of extracting second solubility data of the molecular data appearing in the data set D (n-1) to be repaired from the repair data set D and constructing the extracted second solubility data into a data set F, that is, implementing the content in step 307 to construct a repair data group corresponding to each molecular data to be repaired; the computer device performs step 503 of grouping the second solubility data in the data set F according to the corresponding molecular data thereof, and performing data repair on the grouped data based on a cluster repair algorithm. The specific process of the cluster repair algorithm is shown in fig. 5 (b), the computer device may first execute the second solubility data sorting step 504, that is, the computer device sorts the data based on the solubility values in the second solubility data; then, step 505 of determining whether the solubility difference between two adjacent second solubility data is smaller than the first threshold is executed; if yes, executing a step 506 of merging solubility values in the two adjacent second solubility data, determining the solubility value of the target training data, then executing a step 507 of determining a target weight of the target training data, and regularizing the target weight; if not, continuing to acquire the next adjacent two second solubility data.

Fig. 6 is a flowchart of data recovery provided in an embodiment of the present application, and with reference to fig. 6, a description is given to the training data obtaining method, first, the computer device performs a step 601 of data filtering and normalization, that is, performs the step 302 and the step 303 based on six training data of AQUA, PHYS, ESOL, OCHEM, AQSOL, and CHEMBL, so as to obtain filtered data; the computer device then performs 602 a data set quality assessment and weight assignment, i.e. performs the content of step 304 above; finally, the computer device executes step 603 of cluster repair and quality improvement evaluation, that is, executes the contents of steps 305 to 308 to complete data cleaning and data repair, and obtain cleaned data and repaired data respectively. Table 4 shows the amount of data included in the training data set and the corresponding second weights of the training data set through the respective data processing stages.

TABLE 4

As can be seen from the data in table 4, the data amount of the training data set after the data cleaning and data restoration stages changes due to clustering and merging of solubility data during the data cleaning and data restoration processes. In the six training data sets, the data quality of AQUA, PHYS and ESOL is higher, and the corresponding second weights of the six training data sets are all larger than those of other training data sets.

In the embodiment of the present application, model training may be performed based on training data sets obtained in each stage, and table 5 shows model prediction accuracy information of a solubility prediction model obtained by training when model training is performed based on the training data sets in each stage. Wherein, the model prediction precision information is expressed as the RMSE index of the solubility prediction model and the confidence interval thereof.

TABLE 5

Fig. 7 is a schematic diagram of a model training result provided in an embodiment of the present application, and it can be known from the data in table 5 and fig. 7 that the corresponding RMSE index of the training data set subjected to data restoration is lower, that is, the model training effect is better. As can be seen from the data in table 5, since the data cleansing nodes relate to the same data cluster, the data after cleansing is increased, and the weighted solubility prediction model Chemprop is used for calculation during model training, the RMSE index is improved, and the RMSE will fall back after data restoration.

By applying the technical scheme provided by the embodiment of the application to data restoration, compared with a model trained by an original training data set, the RMSE shows a trend of obvious decline based on a model trained by restored data. For example, the lowest RMSE score application was obtained with the CHEMBL trained model, with RMSE as low as 0.35 (confidence interval of 0.009). The RMSE index obtained using the ESOL trained model decreased from 0.594 to 0.551, i.e., the RMSE index decreased by 0.043 in LogS units. By using other repaired training data sets and adopting a random data partitioning strategy, the RMSE indexes of models trained by AQUA, PHYSP, OCHEM, AQSOL and CHEMBL are reduced by 0.044, 0.042, 0.004, 0.41 and 0.55 respectively. Based on the scaffold data partitioning strategy, the RMSE indexes on AQUA, PHYSP, OCHEM, AQSOL and CHEMBL are respectively reduced by 0.12, 0.08, 0.06, 0.371 and 0.96.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 8 is a schematic structural diagram of a training data obtaining apparatus for a solubility prediction model according to an embodiment of the present application, and referring to fig. 8, the apparatus includes:

a first obtaining module 801, configured to obtain first solubility data of at least two training data sets, where one first solubility data includes a solubility value of one molecular data;

a second obtaining module 802, configured to respectively combine the repeated first solubility data in each training data set to obtain second solubility data corresponding to each training data set and a first weight of each second solubility data, where the first weight is used to indicate a repetition degree of the first solubility data corresponding to the second solubility data;

a first determining module 803, configured to train a solubility prediction model based on the first solubility data of each training data set, and determine a second weight corresponding to each training data set based on a model prediction result of the solubility prediction model, where the second weight is used to indicate data accuracy of each training data set;

a second determining module 804, configured to determine, for any training data set, at least one training data set from the at least two training data sets based on the second weight corresponding to each training data set, as at least one reference data set corresponding to the any training data set;

a data recovery module 805, configured to perform data recovery on any training data set based on a second weight corresponding to a reference data set corresponding to the training data set, second solubility data corresponding to the reference data set, and a first weight of each second solubility data, to obtain target training data, where one target training data includes a solubility value of a molecular data and a target weight of the solubility value, and the target weight is used to indicate accuracy of the solubility data.

In one possible implementation manner, the second obtaining module 802 is configured to:

for each training data set, dividing the first solubility data corresponding to the same molecular data into a group to obtain at least two groups of solubility data;

for each group of solubility data, respectively merging the first solubility data comprising the same solubility value to obtain at least one second solubility data;

determining the first weight of the second solubility data based on a number of the first solubility data included in the second solubility data.

In one possible implementation, the first determining module 803 is configured to:

for each training data set, determining model prediction accuracy of the trained solubility prediction model based on a second target amount of the first solubility data in the training data set;

In one possible implementation, the second determining module 804 is configured to:

comparing the second weight corresponding to each training data set with the second weight corresponding to any training data set;

In one possible implementation, the data repair module 805 includes: