Method and device for model training, antibody modification and binding site prediction

文档序号：363816 发布日期：2021-12-07 浏览：8次中文

阅读说明：本技术 模型训练、抗体改造和结合位点预测的方法与装置 (Method and device for model training, antibody modification and binding site prediction ) 是由蒋彪彬许振雷刘伟黄俊洲于 2021-05-28 设计创作，主要内容包括：本申请实施例提供一种模型训练、抗体改造和结合位点预测的方法与装置,训练方法包括：使用N条第一抗体序列,对预测模型进行预训练,得到预训练后的预测模型,其中第一抗体序列中未标注出所述第一抗体序列与抗原的结合位点,预训练后的预测模型用于预测抗体序列中被掩盖的氨基酸的预测值。由于未标注的第一抗体系列的数量较多,使用大量的第一抗体序列对预测模型进行预训练,可以使该预测模型得到充分的训练,进而提高了预测模型的训练准确性。另外,在预测模型的预训练过程中,对第一抗体序列的可变区进行着重学习,以进一步提高预测模型的训练准确度,使用该预测模型进行抗体相关预测工作时,其预测成本低,且预测效率高。(The embodiment of the application provides a method and a device for model training, antibody modification and binding site prediction, wherein the training method comprises the following steps: and pre-training the prediction model by using the N first antibody sequences to obtain the pre-trained prediction model, wherein the binding sites of the first antibody sequences and the antigen are not marked in the first antibody sequences, and the pre-trained prediction model is used for predicting the predicted value of the masked amino acids in the antibody sequences. Because the number of the unlabeled first antibody series is large, a large number of first antibody sequences are used for pre-training the prediction model, so that the prediction model can be fully trained, and the training accuracy of the prediction model is improved. In addition, in the pre-training process of the prediction model, the variable region of the first antibody sequence is heavily learned so as to further improve the training accuracy of the prediction model, and when the prediction model is used for antibody-related prediction work, the prediction cost is low, and the prediction efficiency is high.)

1. A method for training a predictive model for an antibody, comprising:

acquiring N first antibody sequences, wherein N is a positive integer, and binding sites of the first antibody sequences and antigens are not marked in the first antibody sequences;

pre-training a prediction model by using the N first antibody sequences to obtain a pre-trained prediction model;

wherein the pre-trained prediction model is used for predicting the predicted value of the masked amino acid in the antibody sequence, and the learning frequency of the variable region of the first antibody sequence is higher than that of the non-variable region of the first antibody sequence in the pre-training process of the prediction model.

2. The method of claim 1, wherein said pre-training said predictive model using said N first antibody sequences to obtain a pre-trained predictive model comprises:

and performing unsupervised pre-training on the prediction model by using the N first antibody sequences to obtain a pre-trained prediction model.

3. The method of claim 2, wherein said unsupervised pre-training of said predictive model using said N first antibody sequences to obtain a pre-trained predictive model comprises:

and pre-training the prediction model by using the N first antibody sequences based on a MASK strategy to obtain the pre-trained prediction model.

4. The method according to claim 3, wherein the pre-training the prediction model using the N first antibody sequences based on MASK strategy to obtain a pre-trained prediction model comprises:

masking, for each of the N first antibody sequences, amino acids in a variable region of the first antibody sequence at a first masking frequency and masking amino acids in an invariant region of the first antibody sequence at a second masking frequency to obtain predicted values of masked amino acids predicted by the prediction model;

and pre-training the prediction model according to the loss between the predicted value and the true value of the covered amino acid to obtain the pre-trained prediction model.

5. The method of claim 4, wherein the first cloaking frequency is greater than the second cloaking frequency.

6. The method according to any one of claims 1-5, further comprising:

obtaining M second antibody sequences, M being a positive integer, the second antibody sequences being labeled with binding sites for the second antibody sequences to an antigen;

and finely adjusting the pre-trained prediction model by using the M second antibody sequences to obtain a target prediction model, wherein the target prediction model is used for predicting the binding sites of the antibody sequences and the antigens.

7. The method of claim 6, wherein said using said M second antibody sequences to fine tune said pre-trained predictive model to obtain a target predictive model comprises:

inputting the second antibody sequence into the pre-trained prediction model aiming at each second antibody sequence in the M second antibody sequences to obtain a predicted binding site of the second antibody sequence and an antigen predicted by the pre-trained prediction model;

and finely adjusting the pre-trained prediction model according to the predicted loss between the binding site of the second antibody sequence and the antigen and the predicted actual value of the binding site of the second antibody sequence and the antigen to obtain a target prediction model.

8. The method of claim 6, wherein the second antibody sequence is labeled with a tag sequence at the binding site of the second antibody sequence to the antigen, wherein the tag sequence is the same length as the second antibody sequence, and each value in the tag sequence indicates whether the amino acid corresponding to that value binds to the antigen.

9. An antibody engineering method comprising:

obtaining a target antibody sequence to be modified;

receiving a masking operation of a user on target site amino acids to be modified in the target antibody sequence;

responding to the covering operation, inputting the target antibody sequence with the target site amino acid covered into a pre-trained prediction model to obtain a predicted value of the target site amino acid output by the pre-trained prediction model;

the pre-trained prediction model is obtained by training a first antibody sequence, the binding site of the first antibody sequence and an antigen is not marked in the first antibody sequence, and the learning frequency of the variable region of the first antibody sequence is higher than that of the non-variable region of the first antibody sequence in the pre-training process of the prediction model.

10. The method of claim 9, wherein the inputting the target antibody sequence with the target site amino acid masked into the pre-trained prediction model to obtain the predicted value of the target site amino acid output by the pre-trained prediction model comprises:

inputting the target antibody sequence with the target site amino acid covered into the pre-trained prediction model, and aiming at each preset value of Q preset values, obtaining a probability value when the target site amino acid output by the pre-trained prediction model is replaced by the preset value, wherein Q is a positive integer;

and displaying the first K preset values with the maximum probability value in the Q preset values as the predicted values of the target site amino acids, wherein K is a positive integer less than or equal to Q.

11. The method of claim 10, further comprising:

and displaying the probability value of each preset value in the first K preset values.

12. A method for predicting an antibody binding site, comprising:

obtaining a target antibody sequence to be predicted;

inputting the target antibody sequence into a target prediction model to predict the binding site of the target antibody sequence and an antigen;

the target prediction model is obtained by pre-training a first antibody sequence and then finely adjusting a second antibody sequence, wherein the first antibody sequence is not marked with a binding site of the first antibody sequence and an antigen, the second antibody sequence is marked with a binding site of the second antibody sequence and the antigen, and in the pre-training process of the prediction model, the learning frequency of a variable region of the first antibody sequence is higher than that of an invariable region of the first antibody sequence.

13. The method of claim 12, further comprising:

displaying the binding sites of the target antibody sequence and the antigen predicted by the target prediction model.

14. A training apparatus for a predictive model of an antibody, comprising:

an obtaining unit, configured to obtain N first antibody sequences, where N is a positive integer, and a binding site between the first antibody sequence and an antigen is not marked in the first antibody sequence;

and a training unit, configured to pre-train a prediction model using the N first antibody sequences to obtain a pre-trained prediction model, where the pre-trained prediction model is used to predict a predicted value of a masked amino acid in an antibody sequence, and in a pre-training process of the prediction model, a learning frequency of a variable region of the first antibody sequence is higher than a learning frequency of an invariant region of the first antibody sequence.

15. An antibody engineering device, comprising:

an acquisition unit for acquiring a target antibody sequence to be modified;

the receiving unit is used for receiving a masking operation of a user on the target site amino acid to be modified in the target antibody sequence;

the prediction unit is used for responding to the covering operation, inputting the target antibody sequence with the covered target site amino acid into a pre-trained prediction model, and obtaining a prediction value of the target site amino acid output by the pre-trained prediction model;

16. An antibody binding site prediction device comprising:

an acquisition unit for acquiring a target antibody sequence to be predicted;

a prediction unit, configured to input the target antibody sequence into a target prediction model, and predict a binding site of the target antibody sequence and an antigen;

17. A computing device, comprising: a processor and a memory;

the memory for storing a computer program;

the processor for executing the computer program to implement the method of any one of claims 1 to 8 or 9 to 11 or 12 to 13.

18. A computer-readable storage medium, characterized in that the storage medium comprises computer instructions which, when executed by a computer, cause the computer to carry out the method according to any one of claims 1 to 8 or 9 to 11 or 12 to 13.

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a method and a device for model training, antibody modification and binding site prediction.

Background

Antibodies are important immune proteins that are responsible for recognizing foreign invaders or intrinsic variants, i.e., antigens, within an organism and binding to the antigens to eliminate them.

An important property of an antibody is the affinity of binding to an antigen, the magnitude of which is determined by the binding site of the antibody to the antigen. When the affinity of the antibody is not high enough, the antibody needs to be modified to improve the affinity of the antibody.

At present, the related prediction work of the antibody mainly depends on a structure analysis experiment or a molecular knockout screening experiment, and the cost is high and the time is consumed.

Disclosure of Invention

The embodiment of the application provides a method and a device for model training, antibody modification and binding site prediction, so that the cost of antibody-related prediction work is reduced, and the prediction efficiency is improved.

In a first aspect, an embodiment of the present application provides a training method for a prediction model of an antibody, including:

acquiring N first antibody sequences, wherein N is a positive integer, and binding sites of the first antibody sequences and antigens are not marked in the first antibody sequences;

and pre-training a prediction model by using the N first antibody sequences to obtain a pre-trained prediction model, wherein the pre-trained prediction model is used for predicting the predicted value of the amino acid covered in the antibody sequences, and in the pre-training process of the prediction model, the learning frequency of the variable region of the first antibody sequence is higher than that of the non-variable region of the first antibody sequence.

In a second aspect, embodiments of the present application provide a method of predicting the predictive value of an engineered point in an antibody, comprising:

obtaining a target antibody sequence to be modified;

receiving a masking operation of the target site amino acid to be modified in the target antibody sequence by the user;

in response to the covering operation, inputting the target antibody sequence with the target site amino acid covered into a pre-trained prediction model to obtain a predicted value of the target site amino acid output by the pre-trained prediction model;

In some embodiments, the obtaining of the target antibody sequence to be engineered comprises:

displaying an input box, and receiving the target antibody sequence to be modified input by the user in the input box.

In some embodiments, the inputting the target antibody sequence with the target site amino acid masked in response to the masking operation into a pre-trained prediction model to obtain a predicted value of the target site amino acid output by the pre-trained prediction model comprises:

when the prediction trigger operation of the user is detected, responding to the covering operation, inputting the target antibody sequence with the target site amino acid covered into a pre-trained prediction model, and obtaining the predicted value of the target site amino acid output by the pre-trained prediction model.

In a third aspect, the present embodiments provide a method for predicting an antibody binding site, comprising:

obtaining a target antibody sequence to be predicted;

inputting the target antibody sequence into a target prediction model to predict the binding site of the target antibody sequence and an antigen;

In some embodiments, the obtaining of the target antibody sequence to be predicted comprises:

and displaying a prediction frame, and receiving the target antibody sequence input by the user in the prediction frame.

In some embodiments, the inputting the target antibody sequence into a target prediction model to predict the binding site of the target antibody sequence to an antigen comprises:

and when the prediction trigger operation of the user is detected, inputting the target antibody sequence into a target prediction model, and predicting the binding site of the target antibody sequence and the antigen.

In a fourth aspect, an embodiment of the present application provides a training apparatus for a prediction model of an antibody, including:

and the pre-training unit is used for pre-training the prediction model by using the N first antibody sequences to obtain the pre-trained prediction model, the pre-trained prediction model is used for predicting the predicted value of the amino acid covered in the antibody sequences, and in the pre-training process of the prediction model, the learning frequency of the variable region of the first antibody sequence is higher than that of the non-variable region of the first antibody sequence.

In a fifth aspect, the present embodiments provide an apparatus for predicting the predictive value of an engineered point in an antibody, comprising:

an acquisition unit for acquiring a target antibody sequence to be modified;

a receiving unit, configured to receive a masking operation of the target site amino acid to be modified in the target antibody sequence by the user;

the prediction unit is used for responding to the covering operation, inputting the target antibody sequence with the covered target site amino acid into a pre-trained prediction model, and obtaining a predicted value of the target site amino acid output by the pre-trained prediction model;

In a sixth aspect, the present embodiments provide a device for predicting an antibody binding site, comprising:

an acquisition unit for acquiring a target antibody sequence to be predicted;

a prediction unit, configured to input the target antibody sequence into a target prediction model, and predict a binding site of the target antibody sequence and an antigen;

In a seventh aspect, an embodiment of the present application provides a computing device, including a processor and a memory;

the memory for storing a computer program;

the processor is configured to execute the computer program to implement the method of any of the first to third aspects.

In an eighth aspect, the present application provides a computer-readable storage medium, which includes computer instructions, and when the instructions are executed by a computer, the computer implements the method according to any one of the first to third aspects.

In a ninth aspect, embodiments of the present application provide a computer program product, the program product comprising a computer program, the computer program being stored in a readable storage medium, the computer program being readable from the readable storage medium by at least one processor of a computer, the at least one processor executing the computer program to cause the computer to implement the method of any of the first to third aspects.

According to the method and the device for model training, antibody modification and binding site prediction, N first antibody sequences are used for pre-training a prediction model to obtain the pre-trained prediction model, wherein the binding sites of the first antibody sequences and antigens are not marked in the first antibody sequences, and the pre-trained prediction model is used for predicting the predicted values of the masked amino acids in the antibody sequences. Because the number of the unlabeled first antibody series is large, a large number of first antibody sequences are used for pre-training the prediction model, the prediction model can be fully trained, the pre-training model which is trained without supervision can have strong ductility, and the training accuracy of the prediction model is further improved, so that when the prediction model which is accurately pre-trained is used for carrying out the related prediction work of the antibodies, the prediction cost is low, and the prediction efficiency is high. In addition, since the improvement of the antibody sequence usually occurs in the variable region of the antibody sequence, based on this, the variable region of the first antibody sequence is heavily learned during the pre-training process of the prediction model to further improve the training accuracy of the prediction model.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram of the structure of an antibody sequence according to an embodiment of the present application;

FIG. 2 is a system architecture diagram according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a training method for a predictive model of an antibody according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a training of a predictive model according to an embodiment of the present disclosure;

FIG. 5A is a block diagram of a BERT model according to an embodiment of the present application;

fig. 5B is a schematic diagram of a network structure of the BERT model according to an embodiment of the present application;

FIG. 5C is a schematic diagram of a training process of a target prediction model according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of a training method for a predictive model of an antibody according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart of a method for predicting an antibody binding site according to an embodiment of the present disclosure

FIG. 8 is a schematic illustration of an exchange interface according to an embodiment of the present application;

FIG. 9 is a schematic flow chart of an antibody engineering method provided in an embodiment of the present application;

FIG. 10 is a schematic illustration of another exchange interface according to an embodiment of the present application;

FIG. 11A is a graph illustrating test results according to an embodiment of the present disclosure;

FIG. 11B is a graph showing the results of another test according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a training apparatus for a prediction model of an antibody according to an embodiment of the present disclosure;

FIG. 13 is a schematic diagram of an antibody engineering apparatus provided in an embodiment of the present application;

FIG. 14 is a schematic diagram of a structure of a prediction device for antibody binding sites according to the present invention;

fig. 15 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be understood that, in the present embodiment, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

In the description of the present application, "plurality" means two or more than two unless otherwise specified.

In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

The embodiment of the application is applied to the technical field of software testing, and particularly applied to the legality check of the requirement data, so that the test case can be stably and efficiently generated according to the legal requirement data.

In order to facilitate understanding of the embodiments of the present application, the related concepts related to the embodiments of the present application are first briefly described as follows:

antigen: molecules that can elicit an immune response in an organism are referred to as antigens. The antigen may be derived from outside the organism, such as a protein on a novel coronavirus; or may be a protein with a mutation produced in vivo, for example, by a tumor cell.

Antibody: under the stimulation of antigen, B cells in the organism produce a protein molecule which is combined with the antigen, and after combination, an immune response is caused to remove the antigen. The sequence of the antibody protein consists of 20 amino acids (as shown in figure 1), which form a macromolecule with 3D structure and biological activity by folding.

Variable region: also known as CDR (complementary-Determining Regions) (shown as dark Regions in FIG. 1), is a potential region for binding of antibodies to antigens, and has a strong flexible structure. B cells producing different antibodies can change the amino acid of the region through VDJ gene recombination and somatic hypermutation, thereby enhancing the binding capacity with antigen. The VDJ gene segments are clustered in the germline gene, and some gene segments must be selected from these clustered genes to be recombined to encode the complete functional Ig polypeptide chain, which is called gene rearrangement (genetic rearrangement). Germline gene rearrangements occur in a significantly timed sequence, with a heavy chain variable region rearrangement followed by a light chain rearrangement. The variable region gene is further linked to the constant region upon stimulation with an antigen. The gene rearrangement is mainly realized by the action of a group of VDJ recombinase, besides a sister chromosome exchange mechanism possibly existing in theory in an asymmetric exchange form, wherein the action comprises the recognition of conserved sequences positioned at two sides of a VDJ gene segment, the cutting and the repair of DNA and the like, and the conserved sequences are called Recombination Signal Sequences (RSS). At the level of germline DNA, V-region gene recombination is first performed: the light chain V region gene is formed by connecting a V gene segment and a J gene segment; the heavy chain V region gene is firstly connected into DJ by a D gene segment and any J gene segment, and then the V gene segment is connected with DJ into VDJ to form a complete heavy chain V region coding gene; subsequent transcription of the DNA into initially transcribed RNA, at the RNA level, the C gene segment being linked to the VJ or VDJ gene by RNA cleavage; l is linked to the VI or VDJ gene in the same manner and forms mRNA. The heavy and light chain mrnas are then translated into heavy and light chain proteins, which are post-translationally modified, and the light and heavy chains are disulfide-linked to form Is. Leader sequence L leader peptide directs Ig into the secretory pathway for secretion to the extracellular side, and the L leader peptide is subsequently cleaved off.

The invariable region: also known as fwr (frame Work region) region (shown as the non-colored region in fig. 1), is the structural framework region of the antibody, and has stable structure and support effect on the whole. The immutable regions of different antibodies have strong similarity, and the sequences and structures are highly conserved in evolution and are not easy to change, so the antibody is named.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The natural language model is as follows: a large number of human language words are converted into machine language through a statistical model, and then used for cognition, understanding and generation. Specific applications include machine translation, automatic question answering and the like.

Pre-training: training a language model through a large number of unmarked language texts to obtain a set of model parameters; the model is initialized by utilizing the set of parameters, so that the model is hot started, and the parameters are finely adjusted on the framework of the existing language model according to the specific task to fit the label data provided by the task. The pre-training method has been proven to have good results in both classification and labeling tasks of natural language processing.

Antibodies are important immunological proteins responsible for recognizing foreign invaders or intrinsic variants, i.e., antigens, within an organism. When animals and human bodies invade foreign species such as viruses and bacteria, antibodies are generated to recognize the foreign species. These antibodies are also known as natural antibodies. Among these natural antibodies, few antibodies have high affinity, high specificity, stability and solubility, and thus have medicinal value and can be used as a medicament after mass production.

The most important property in an antibody is the affinity for binding to an antigen. The magnitude of affinity is determined by the binding site of the antibody to the antigen. Secondly, other physicochemical properties of the antibody. The pharmaceutical industry needs to optimize other physicochemical properties of antibodies without changing the affinity. This means that recognizing the binding site of an antibody to an antigen is a very critical task. When the affinity of the antibody is not high enough, the binding site needs to be modified; when the affinity is high enough and the physicochemical properties are not good, the non-binding site needs to be modified to improve the physicochemical properties.

How to find good-quality candidate antibodies from massive natural antibodies and further identify the binding sites of the candidate antibodies is a main task of each large pharmaceutical factory. In one possible implementation of the present application, the optimization of antibody drug candidates relies mainly on the artificial experience of the pharmacogenist, and is iteratively refined through trial and error and validation (real-and-error), for example, currently the determination of binding sites of antibodies relies mainly on expensive structure-resolving experiments or time-consuming molecular knockout screening experiments. This requires extremely high manpower and material resources.

The AI technology has the greatest advantage that a large amount of learning data can be digested in a short time through a self-learning process, so that the purpose of no teaching and self-learning is realized.

Based on this, the examples of the present application utilize AI technology to identify the binding site of an antibody sequence. Specifically, adopt first antibody sequence to train the prediction model, because the study in the invariable region is mostly futile, do not have too much practical meaning to the promotion of accuracy, from antibody drug optimization transformation, the transformation to the variable region is more frequent than the transformation of invariable region, consequently for improving training efficiency and training accuracy, in the target prediction model training process, the learning frequency of the variable region of first antibody sequence is higher than the learning frequency of the invariable region of first antibody sequence, make the prediction model focus on learning the intrinsic law in variable region, and only need a small amount of study to the invariable region can reach fine prediction accuracy. The trained prediction model can quickly and accurately predict the binding site of the antibody sequence and the antigen, and the recognition cost is low. Namely, the embodiment of the application utilizes AI technology to assist in identifying the binding sites of the drugs of the antibody, thereby reducing the expenditure of manpower and material resources and improving the identification efficiency of the binding sites of the antibody.

The application scenario of the application includes but is not limited to the fields of medical treatment, biology, scientific research and the like, for example, the application is used for drug production, drug research and development, vaccine research and development and the like, the application is used for rapidly and accurately identifying the binding site of an antibody sequence and an antigen, the whole identification process does not need human intervention, and the identification cost is low.

In some embodiments, the system architecture of embodiments of the present application is shown in fig. 2.

Fig. 2 is a schematic diagram of a system architecture according to an embodiment of the present application, which includes a user device 101, a data acquisition device 102, a training device 103, an execution device 104, a database 105, and a content library 106.

The data acquisition device 102 is configured to read training data from the content library 106 and store the read training data in the database 105. The training data referred to in the examples of the present application include N primary antibody sequences and M secondary antibody sequences.

In some embodiments, the user device 101 is configured to label the second antibody sequence in the database 105, i.e., to label the binding site of the second antibody sequence to the antigen.

The training device 103 trains the prediction model based on training data maintained in the database 105, so that the trained target prediction model can accurately predict the binding sites of the antibody sequence and the antigen. The object prediction model obtained by the training apparatus 103 may be applied to different systems or apparatuses.

In fig. 2, the execution device 104 is configured with an I/O interface 107 for data interaction with an external device. The target antibody sequence to be predicted sent by the user equipment 101 is received, for example, through the I/O interface. The calculation module 109 in the execution device 104 processes the input target antibody sequence using the trained target prediction model, outputs the binding site of the target antibody sequence and the antigen, and sends the binding site of the target antibody sequence and the antigen to the user device 101 through the I/O interface.

The user device 101 may include a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), or other terminal devices with a browser installation function.

The execution device 104 may be a server.

For example, the server may be a rack server, a blade server, a tower server, or a rack server. The server may be an independent test server, or a test server cluster composed of a plurality of test servers.

In this embodiment, the execution device 104 is connected to the user device 101 through a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and a positional relationship between devices, modules, and the like shown in the diagram does not constitute any limitation. In some embodiments, the data collection device 102 may be the same device as the user device 101, the training device 103, and the performance device 104. The database 105 may be distributed on one server or a plurality of servers, and the content library 106 may be distributed on one server or a plurality of servers.

The technical solutions of the embodiments of the present application are described in detail below with reference to some embodiments. The following several embodiments may be combined with each other and may not be described in detail in some embodiments for the same or similar concepts or processes.

First, a training process of the prediction model according to an embodiment of the present application will be described with reference to fig. 3.

Fig. 3 is a schematic flowchart of a training method for a prediction model of an antibody according to an embodiment of the present disclosure, as shown in fig. 3, including:

s301, obtaining N first antibody sequences.

Wherein N is a positive integer.

The execution subject of the embodiments of the present application is an apparatus having a model training function, such as an antibody binding site prediction apparatus, which may be a computing device, or a part of a computing device, such as a processor in a computing device. Illustratively, the prediction device of the antibody binding site can be the training device in fig. 2. Wherein the training device in fig. 2 may be understood as a computing device, or a processor in a computing device, etc.

For convenience of description, the following embodiments are described taking an execution subject as an example of a computing device.

It should be noted that an antibody is a very specific protein, and is a sequence consisting of a plurality of amino acids, and therefore, the antibody of the present embodiment can be understood as an antibody sequence.

Optionally, the first antibody sequence is a native antibody sequence.

In one example, N (e.g., more than 18 billion) native Antibody sequences are collected from an OAS database (used Antibody Space database), a vast amount of data from different animals and patients, generated under stimulation by more than a variety (e.g., 30) of different antigens. This large database can be considered as an ideal sampling of the antibody sequence space. These N natural antibody sequences were used as N primary antibody sequences.

The above-described N natural antibody sequences collected from the OAS database are all data whose binding site is unknown, i.e., no-tag data. Most of these antibody sequences are structurally unknown, and the binding sites are unknown.

That is, in the embodiments of the present application, the first antibody sequence is unlabeled training data, i.e., the first antibody sequence is not labeled with the binding site of the first antibody sequence to the antigen.

S302, pre-training the prediction model by using the N first antibody sequences to obtain the pre-trained prediction model.

The pre-trained prediction model is used to predict the predicted values of masked amino acids in an antibody sequence.

Fig. 4 is a training diagram of a prediction model according to an embodiment of the present application. As shown in fig. 4, in the pre-training process, N first antibody sequences are used to pre-train the prediction model, where the first antibody sequences are unlabeled training data, and the prediction model is learned through an autonomous learning mechanism to obtain the pre-trained prediction model.

The prediction model in the embodiment of the present application is a deep neural network model, and since there are few antibody sequences with known binding sites, it is far from sufficient to train a robust deep neural network model using antibody sequences with known binding sites, resulting in low prediction accuracy mcc (matthews Correlation coeffient) of the prediction model. Therefore, the prediction model is pre-trained by using a large number of unlabeled first antibody sequences, so that the prediction model learns the internal structure of the antibody sequences through a self-learning mechanism, and the prediction value of the amino acid at the covered position in the antibody sequences can be predicted by the pre-trained prediction model.

The embodiment of the present application does not limit the specific type of the prediction model, but only a deep neural network model that can predict the binding sites of the antibody sequences.

In one possible implementation, the prediction model of the embodiment of the present application is BERT (bidirectional Encoder responses from transforms), and the BERT includes a plurality of bidirectional transforms.

Fig. 5A is a framework diagram of a BERT model according to an embodiment of the present application, and fig. 5B is a schematic network structure diagram of the BERT model according to the embodiment of the present application, and it should be noted that Trm in fig. 5B represents a Transformer. As shown in fig. 5A, the BERT model is mainly composed of three parts: an embedding layer, an encoding layer, and a pooling layer. The network structure of the BERT model is briefly described below.

Embedding layer: the input sequence is converted into a continuous distributed representation (distributed rendering), i.e. the input sequence is converted into a word embedding (word embedding) or a word vector (word vector).

In general, the input of BERT may be a sequence, such as an antibody sequence.

BERT first labels (tokenize) the sequence with a special marker (tokenizer). The tokenizer performs rule-based tokenization (tokenization) on the sequence, followed by word segmentation (segmentation). Subword segmentation can be implemented to compress vocabularies, represent unknown words, represent internal structure information of words, and the like. The sequences in the data set are not necessarily of equal length, and BERT solves this problem by using a fixed input sequence (long-term truncation, short-term padding). The first flag of each sequence is then always a special classification flag ([ CLS ]), and the final hidden state corresponding to this flag is used as an aggregated sequence representation of the classification task.

After segmentation, each space-segmented substring (substring) is considered as a marker (token), for example, each amino acid in the antibody sequence is used as a marker (token). The marker maps these marks to integer codes by means of a look-up table.

In some embodiments, the token (token) is also referred to as a token.

The entire sequence is represented by three types of coded vectors, including: marker encoding (also called marker embedding), segment encoding (also called segment embedding), and position encoding (also called position embedding). Wherein, the mark coding is a vector obtained after each mark in the sequence is converted into coding; segment codes record which sequence each mark belongs to, 0 is the first sequence, 1 is the second sequence, note: the [ CLS ] flag corresponds to a position code of 0. The position code records the position of each mark.

As shown in fig. 5B, the input embedding is denoted as E, the final concealment vector for the special CLS flag is denoted as C, and the final concealment vector for the ith input flag is denoted as Ti.

For a given token, its input representation is constructed by summing the corresponding token embedding (token embedding), segment embedding (segmentation embedding) and position embedding (position embedding).

And (3) coding layer: the method is to perform nonlinear representation on an embedding vector output by an embedding layer and extract a feature (feature) representation in the embedding vector.

Optionally, the coding layer is composed of a plurality of transformers with the same structure.

Wherein, the Transformer is composed of an encoder (encoder) and a decoder (decoder). Optionally, the encoder is composed of a plurality (e.g., 6) of identical two sub-layers, a first sub-layer is a multi-head attention layer (multi-head attention), and a second sub-layer is a general feedforward neural network. The decoder is composed of a plurality of (e.g., 6) identical three sub-layers, the first sub-layer is a multi-headed attention layer (masked multi-head attention) of the mask, the second sub-layer is a multi-headed attention layer (multi-head attention), and the third sub-layer is a general feedforward neural network.

Multi-head self-attention is a feature of the transform, which enables the model to treat different inputs (i.e. assign different weights) in different ways regardless of the shape, size and distance of the space (i.e. the input vectors are arranged into a linear, surface, tree, graph and other topological structures). In addition, the Transformer can separately split the calculation of the vectors involved in the attention, thereby improving the representation capability.

A pooling layer: the representation corresponding to the [ CLS ] mark is taken out and is converted to be used as the representation of the whole sequence and is output, and the characteristic representation of each mark output by the last layer of the coding layer is output without change.

On the basis of the above description of the network structure of the prediction model, the pre-training process of the prediction model is described below.

The number of parameters in the BERT model is large, for example, for the basic BERT model, the parameters included are: l-768, a-12, and a total of 110 megabits, where L denotes the number of layers, H denotes the concealment size, and a denotes the number of self-attentions. When training a BERT model with more parameters, a large amount of unlabeled data is used for pre-training.

As shown in fig. 5B, in the pre-training process, the prediction model is pre-trained using N first antibody sequences, where the process of training the prediction model using each first antibody sequence is the same, and one first antibody sequence is taken as an example. As an input for each amino acid in the first antibody sequence, for example the first antibody sequence comprises the amino acids: AA₁、AA₂…AA_NAnd inputting the amino acids into a prediction model, and training the prediction model to obtain a pre-trained prediction model.

As shown in FIG. 5B, the amino acid AA was imported₁、AA₂…AA_NIs subjected to an embedding process as input to an encoder (e.g., a Transformer). Alternatively, the embedding process may be understood as a sum of token embedding, segmentation embedding, and position embedding. It should be noted that an amino acid is understood as a word, and an amino acid is inputted and searched in a dictionary to obtain a number index corresponding to the word, i.e. token, and then the number index is searched in a lookup table to obtain the number indexThe vector (embedding) corresponding to the numerical subscript is embedding.

In some embodiments, the S302 includes: and carrying out unsupervised pre-training on the prediction model by using the N first antibody sequences to obtain the pre-trained prediction model.

In some embodiments, the S302 includes:

S302-A, pre-training the prediction model by using N first antibody sequences based on a MASK strategy to obtain the pre-trained prediction model.

Taking a first antibody sequence as an example, randomly masking one or geometric amino acids in the first antibody sequence predicts the predicted value of the masked amino acid. The prediction model is reverse trained based on the loss between predicted and true values of masked amino acids.

For example, as shown in FIG. 5B, assuming that the amino acid AA2 in the first antibody sequence is masked, i.e., the amino acid AA2 is replaced by MASK, or other data, the prediction model is input, the prediction model predicts the prediction value of MASK-removed amino acid AA2 according to the context information of the first antibody sequence, and outputs, i.e., the MASK output by the prediction model in FIG. 5B is the prediction value of the amino acid at the corresponding position. Referring to the above example, masking of amino acids at other positions in the first antigen sequence, prediction of the predicted values of the masked amino acids, and in turn, reverse training of the prediction model based on the loss between the predicted values and the true values of the amino acids, to obtain a pre-trained prediction model.

Antibodies are a very specific class of proteins with specific properties and characteristics that need to be considered and improved during model training to better serve a specific task. One of the special features of antibodies compared to general proteins is that the antibody sequences are divided into variable regions (CDRs) and non-variable regions (FWRs). Antibodies consist of a heavy and a light chain, generally comprising three variable regions in the functional domain in contact with the antigen (Fv domain), and sandwiching four non-variable regions, as shown in figure 1. From the optimization and modification of antibody drugs, the modification of the variable region is more frequent than the modification of the invariable region, so that the model should focus on learning the internal rule of the variable region, and the invariant region can be well predicted only by a small amount of learning.

Based on this, in the process of pre-training the prediction model by using the first antibody sequence, the learning frequency of the variable region of the first antibody sequence is higher than that of the non-variable region of the first antibody sequence, so that the prediction model learns the characteristics of the variable region of the first antibody sequence in an emphasized manner, and thus, in the later use process, the trained prediction model can be used to accurately predict the characteristic information of the variable region in the antibody sequence.

In one possible implementation, the S302-A includes S302-A1 and S302-A2:

S302-A1, for each first antibody sequence in the N first antibody sequences, masking amino acids in a variable region of the first antibody sequence according to a first masking frequency, and masking amino acids in an invariable region of the first antibody sequence according to a second masking frequency to obtain a predicted value of the masked amino acids predicted by a prediction model;

S302-A2, pre-training the prediction model according to the loss between the predicted value and the true value of the masked amino acid, and obtaining the pre-trained prediction model.

Wherein the first cloaking frequency is greater than the second cloaking frequency.

According to the embodiment of the application, different probability covering tests are carried out on different regions aiming at the characteristics of different regions in the first antibody sequence, so that the prediction model can better distribute learning weight, the training efficiency of the prediction model is improved, and the accurate training of the prediction model is realized.

Specifically, in the prediction process, the amino acids in the variable region of the first antibody sequence are masked at a first masking frequency, and the amino acids in the non-variable region of the first antibody sequence are masked at a second masking frequency. Since the binding site of an antibody to an antigen is usually in the variable region of an antibody, as shown in fig. 5A, the amino acids in the variable region are masked many times so that the prediction model learns the characteristics of the variable region of the first antibody sequence with emphasis, and the amino acids in the non-variable region are masked few times so as to improve the training speed of the prediction model.

It should be noted that, in the embodiment of the present application, specific values of the first masking frequency and the second masking frequency are not limited, but the first masking frequency is greater than the second masking frequency.

In one possible implementation, the ratio of the sequence lengths of the invariant regions (FWR) and the variable regions (CDR) is about 5: 2, and mutation rate of 1: 10. for example, the mask learning weight ratio of the invariable area and the variable area may be set to 1: and 4, namely, the first mask rate of the variable region is 20%, and the second mask rate of the non-variable region is 5%.

In some embodiments, the S302 includes: removing repeated first antibody sequences in the N first antibody sequences to obtain P first antibody sequences, wherein P is a positive integer smaller than N; and pre-training the prediction model by using the P first antibody sequences to obtain the pre-trained prediction model. The process of using P first antibody sequences to pre-train the prediction model to obtain the pre-trained prediction model refers to the above description, and is not repeated here.

In the embodiment of the application, N first antibody sequences are used for pre-training a prediction model to obtain the pre-trained prediction model, wherein the binding sites of the first antibody sequences and antigens are not marked in the first antibody sequences, and the pre-trained prediction model is used for predicting the predicted values of the masked amino acids in the antibody sequences. Because the number of the unlabeled first antibody series is large, a large number of first antibody sequences are used for pre-training the prediction model, so that the prediction model can be fully trained, the pre-training model which is trained without supervision can have strong ductility, and the training accuracy of the prediction model is further improved. In addition, since the improvement of the antibody sequence usually occurs in the variable region of the antibody sequence, based on this, the variable region of the first antibody sequence is heavily learned during the pre-training process of the prediction model to further improve the training accuracy of the prediction model.

The above describes a pre-training process of the prediction model, and the embodiment of the present application further includes a process of fine-tuning the pre-trained prediction model.

Fig. 6 is a flowchart illustrating a method for training a prediction model for an antibody according to an embodiment of the present disclosure, as shown in fig. 6, including:

s401, obtaining N first antibody sequences, wherein N is a positive integer, and binding sites of the first antibody sequences and antigens are not marked in the first antibody sequences.

S402, pre-training the prediction model by using the N first antibody sequences to obtain a pre-trained prediction model, wherein the pre-trained prediction model is used for predicting the prediction value of the amino acid covered in the antibody sequences.

The specific implementation process of S401 and S402 refers to the description of S401 and S402, which is not described herein again.

And S403, obtaining M second antibody sequences, wherein M is a positive integer, and the second antibody sequences are marked with the binding sites of the second antibody sequences and the antigen.

S404, fine-tuning the pre-trained prediction model by using M second antibody sequences to obtain a target prediction model, wherein the target prediction model is used for predicting the binding sites of the antibody sequences and the antigens.

It should be noted that the S403 and the S401 are not in sequence when executed, that is, the S403 may be executed before the S401, or after the S401, or simultaneously with the S401.

In some embodiments, N is greater than M, i.e., in embodiments of the present application, the number of first antibody sequences used to train the predictive model is greater than the number of second antibody sequences. This is because the complex structure of antibody and antigen is needed to be analyzed to obtain the binding site of antibody, and the analysis of protein structure is a very time-consuming and labor-consuming scientific research. This results in that so far fewer antibody sequences with known binding sites can be collected, for example only 1662, and in the present example M antibody sequences are selected from these antibody sequences with known binding sites as the second antibody sequence, for example M1662. It should be noted that 1662 above are only an example, and the second antibody sequence in the present embodiment may be selected from the 1662 antibody sequences, or may be obtained by other methods, that is, the present embodiment does not limit the manner of obtaining the second antibody sequence, as long as the second antibody sequence is an antibody sequence with a known binding site.

The prediction model in the embodiment of the present application is a deep neural network model, and since there are few antibody sequences with known binding sites, it is far from sufficient to train a robust deep neural network model using antibody sequences with known binding sites, resulting in low prediction accuracy mcc (matthews Correlation coeffient) of the prediction model. Therefore, in the embodiment of the application, the prediction model is pre-trained by using a large number of unlabeled first antibody sequences, and the pre-trained prediction model is fine-tuned by using a small number of second antibody sequences, so that the target prediction model is obtained, and the prediction accuracy of the target prediction model is improved.

Fig. 5C is a schematic diagram of a training process of the target prediction model according to an embodiment of the present application. Referring to fig. 5C, a pre-training (pre-training) section and a fine tuning (fine tuning) section are included. Wherein the pre-training portion is described with reference to the above embodiments. And after the pre-trained prediction model is obtained, fine tuning is carried out on the pre-trained prediction model by using a second antibody sequence with a label, the binding sites are labeled in the second antibody sequence input into the pre-trained prediction model, so that the pre-trained prediction model can obtain the binding sites of the pre-trained antibody sequence, which are the tasks of the pre-trained prediction model, and then the pre-trained prediction model learns the second antibody sequences with known binding sites to predict the binding sites of the second antibody sequences.

Specifically, taking a second antibody sequence as an example, each amino acid in the second antibody sequence is used as an input to a pre-trained predictive model that outputs a predictive value of whether each amino acid binds to an antigen, e.g., amino acid AA in FIG. 5C₁The corresponding output result is 0, which indicates that the pre-trained prediction model predicts the amino acid AA₁Amino acid AA not binding to antigen₂The corresponding output is binding (Bind), which indicates that the pre-trained prediction model predicts the amino acid AA₂Binding to an antigen. Because the binding site of the second antibody sequence is known, parameters in the pre-trained prediction model are finely adjusted according to the loss between the predicted binding site of the amino acid and the antigen and the actual value of the binding site of the second antibody sequence predicted by the pre-trained prediction model, so that a target prediction model with high prediction precision is obtained, and the target prediction model can accurately predict the binding site of the antibody sequence and the antigen.

In some embodiments, the second antibody sequence is labeled with a tag sequence at the binding site of the second antibody sequence to the antigen, wherein the tag sequence is the same length as the second antibody sequence, and each value in the tag sequence indicates whether the corresponding amino acid of the value binds to the antigen.

Alternatively, if an amino acid in the second antigenic sequence binds to the antigen, the value of the tag at that amino acid position is 1, and if the amino acid does not bind to the antigen, the value of the tag at that amino acid position is 0.

Wherein whether the antibody binds to the antigen is determined by the structural complex of the antigen and the antibody, and illustratively, when the distance between the amino acid of the antigen and the amino acid of the antibody is smaller than(i.e. 5 x 10)^-10m) is considered to be the binding of the two amino acids.

In the embodiment of the application, the prediction model takes the pre-training parameters as initial values, fine-tunes the labeled data, and finally gives the probability of whether each amino acid in the sequence is a binding site.

In the embodiment of the application, by obtaining N first antibody sequences and M second antibody sequences, N, M are positive integers, the binding sites of the first antibody sequences and the antigen are not marked in the first antibody sequences, and the binding sites of the second antibody sequences and the antigen are marked in the second antibody sequences; pre-training the prediction model by using N first antibody sequences to obtain a pre-trained prediction model; and (4) using M second antibody sequences to finely adjust the pre-trained prediction model to obtain a target prediction model. In the embodiment of the application, the prediction model is pre-trained by using a large number of unlabeled first antibody sequences, and the pre-trained prediction model is fine-tuned by using a small number of second antibody sequences to obtain a target prediction model, wherein the target prediction model can be used for predicting the binding sites of the antibody sequences and the antigens.

The above describes the training process of the prediction model, and the following describes the use process of the prediction model.

The prediction model of the embodiment of the present application can be applied to at least the following two scenarios, wherein scenario 1, the binding site of the antibody sequence is predicted by using the pre-trained prediction model. In scenario 2, the prediction value of a certain position point in the antibody sequence is predicted using the target prediction model.

It should be noted that the two scene users can select according to actual needs, for example, two options of scene 1 and scene 2 are displayed on the interactive interface, and the user can select whether to enter scene 1 or scene 2 according to needs.

The following first describes a specific process of predicting the binding site of an antibody sequence using a trained prediction model in scenario 1 with reference to fig. 7.

Fig. 7 is a schematic flowchart of a method for predicting an antibody binding site according to an embodiment of the present application, as shown in fig. 7, including:

s601, obtaining a target antibody sequence to be predicted.

S602, inputting the target antibody sequence into a target prediction model, and predicting the binding site of the target antibody sequence and the antigen.

The target prediction model is obtained by pre-training a first antibody sequence and then finely adjusting a second antibody sequence, wherein the first antibody sequence is not marked with a binding site of the first antibody sequence and an antigen, and the second antibody sequence is marked with a binding site of the second antibody sequence and the antigen. Specific reference is made to the description of the above embodiments, which are not repeated herein.

Alternatively, the variable region of the first antibody sequence has a higher learning frequency than the non-variable region of the first antibody sequence.

In some embodiments of the present application, the present application further provides an interactive interface with the user, on which a prediction box is displayed, which may be understood as an input box within which the user may enter the target antibody sequence to be predicted. That is, the above S601 includes: and displaying a prediction frame, and receiving the target antibody sequence input by the user in the prediction frame.

In some embodiments of the present application, the above S602 includes the following S602-A1 and S602-A2:

S602-A1, when the prediction trigger operation of the user is detected, inputting the target antibody sequence into a target prediction model, and predicting the binding site of the target antibody sequence and the antigen.

Optionally, when the binding site of the target antibody sequence and the antigen predicted by the target prediction model is obtained, the binding site of the target antibody sequence and the antigen is displayed.

Illustratively, as shown in fig. 8, the interactive interface includes a prediction box in which a user can input a target antibody sequence to be predicted, a trigger button such as a submit button, and a display area for displaying a binding site of the target antibody sequence predicted by the target prediction model and an antigen.

Specifically, as shown in fig. 8, when the user needs to know the binding site of the target antibody sequence to be predicted and the antigen, the user inputs the target antibody sequence in the prediction box and clicks the submit button submit. When the prediction triggering operation of the user is detected, the computing device inputs the target antibody sequence into the target prediction model, so that the target prediction model predicts the binding site of the target antibody sequence and the antigen, and displays the binding site of the target antibody sequence and the antigen predicted by the target prediction model in the display area. In FIG. 8, the light regions are the binding sites of the target antibody sequence and antigen, and are K, NTV, and RSGYYGVF, respectively.

In this embodiment, a target antibody sequence to be predicted is input into a trained target prediction model, and the target prediction model can predict a binding site of the target antibody sequence, so that the whole process is simple, thereby reducing the expenditure of manpower and material resources. Thus, the binding site of the target antibody sequence is predicted by using the target prediction model with high prediction precision, and the prediction accuracy of the binding site is improved.

First, referring to fig. 9, in scene 2, the prediction value of a certain position point in the antibody sequence is predicted using the pre-trained prediction model.

Fig. 9 is a schematic flow chart of an antibody modification method provided in an embodiment of the present application, as shown in fig. 9, including:

s801, obtaining a target antibody sequence to be modified;

s802, receiving a masking operation of a user on target site amino acids to be modified in a target antibody sequence;

and S803, responding to the covering operation, inputting the target antibody sequence with the covered target site amino acid into the pre-trained prediction model to obtain the predicted value of the target site amino acid predicted by the pre-trained prediction model.

The pre-trained prediction model is obtained by training a first antibody sequence, and the binding site of the first antibody sequence and the antigen is not marked in the first antibody sequence. The pre-training process of the prediction model refers to the description of the above embodiments, and is not repeated herein.

Optionally, in the training process of the pre-trained prediction model, the learning frequency of the variable region of the first antibody sequence is higher than the learning frequency of the non-variable region of the first antibody sequence.

In some cases, for example after the pharmaceutical factory has experimentally screened candidate antibody sequences, it is desirable to optimize them: for example, modifying a murine site to reduce immunogenicity, modifying an oxidizable site to facilitate storage and preservation, modifying a charged site to avoid aggregation caused by adhesion, thereby reducing solubility, and the like. Under the conditions, the computing equipment acquires a target antibody sequence to be modified, receives a masking operation of a user on the target site amino acid to be modified in the target antibody sequence, responds to the masking operation, inputs the target antibody sequence with the masked target site amino acid into a pre-trained prediction model, and obtains a predicted value of the target site predicted by the pre-trained prediction model.

In some embodiments, the present application provides an interactive interface for a user, as shown in fig. 10, the interactive interface includes an input box into which the user inputs a target antibody sequence to be modified, and a display area, and the computing device displays the target antibody sequence input by the user in the input box. The user performs a masking operation, specifically, the user replaces the target site amino acid to be modified in the target antibody sequence with "[ MASK ]". Then, the user clicks the submit option, the computing device receives a masking operation of the user on the target site amino acid of the target antibody sequence, and in response to the masking operation, the target antibody sequence with the target site amino acid replaced with "[ MASK ]" is input into the pre-trained prediction model, and the pre-trained prediction model predicts the predicted value of the replaced target site amino acid in combination with the context information.

In some embodiments, the target antibody sequence with the masked target site amino acid in S803 is input into the pre-trained prediction model to obtain the predicted value of the target site amino acid predicted by the pre-trained prediction model, which includes the following steps S803-a1 and S803-a 2:

S803-A1, inputting the target antibody sequence with the target site amino acid covered into the pre-trained prediction model, and aiming at each preset value of Q preset values, obtaining the probability value when the target site amino acid predicted by the pre-trained prediction model is replaced by the preset value, wherein Q is a positive integer;

and S803-A2, displaying the first K preset values with the maximum probability value in the Q preset values as the predicted values of the target locus, wherein K is a positive integer less than or equal to Q.

In this embodiment, Q preset values are preset, and the pre-trained prediction model predicts a probability value corresponding to the target site amino acid to be replaced with the "[ MASK ]" when the target site amino acid is replaced with each of the Q preset values. For example, as shown in FIG. 10, the probability value when the amino acid at the target site is substituted with G is 0.547, the probability value when the amino acid at the target site is substituted with S is 0.238, the probability value when the amino acid at the target site is substituted with F is 0.074, the probability value when the amino acid at the target site is substituted with V is 0.031, the probability value when the amino acid at the target site is substituted with Y is 0.025, and the like. The sum of the probability values corresponding to the Q preset values is 1. According to the magnitude of the probability value, the first K preset values with the maximum probability value are selected from the Q preset values to serve as the preset values of the target site amino acids, for example, the preset value G with the maximum probability value is selected to serve as the predicted values of the target site amino acids, or, as shown in FIG. 10, the first 5 preset values with the maximum probability value are selected to serve as the predicted values of the target site amino acids, and the predicted values of the target site amino acids are displayed in the display area. Optionally, when the number of the predicted values of the amino acids at the target site is greater than 1, the predicted values of the amino acids at the target site can be displayed in sequence from large to small according to the probability value.

In some embodiments, as shown in fig. 10, the display area may display probability values for each of the K predetermined values in addition to the first K predetermined values of the target site amino acids. Thus, the user can conveniently select the target predicted value of the target site amino acid from K preset values according to the probability value.

In some embodiments, as shown in fig. 10, after the user masks the target site amino acid in the target antibody sequence, a prediction triggering operation is performed, for example, clicking a "submit" option in fig. 10, and when the prediction triggering operation of the user is detected, the computing device inputs the target antibody sequence with the target site amino acid masked into the pre-trained prediction model in response to the masking operation of the user, so as to obtain the predicted value of the target site amino acid output by the pre-trained prediction model.

In this embodiment, when the antibody needs to be modified, a target antibody sequence to be modified is obtained; receiving a masking operation of a user on target site amino acids of a target antibody sequence; and responding to the covering operation, inputting the target antibody sequence with the covered target site amino acid into the pre-trained prediction model to obtain the predicted value of the target site amino acid predicted by the pre-trained prediction model. The embodiment of the application reduces the modification difficulty of the antibody sequence and improves the modification efficiency of the antibody sequence.

The technical effects of the embodiments of the present application will be further described below with reference to specific tests.

For scenario 2, namely, a trained prediction model is used to predict the predicted value of a certain position point in an antibody sequence, a first number (for example, 100) of antibody sequences which are not used for training are selected, a part of amino acids of the first number of antibody sequences are masked, and whether the error of a target prediction model in predicting the masked amino acids is lower than that of a protein sequence pre-training model ProtTrans (manufactured by enterprises such as the firm laboratory of TUM, google, NVIDIA, and the like, of the university of munich). As shown in FIG. 11A, the prediction error (measured by cross-entropy loss) of the target prediction model of the present application is significantly lower than that of the ProtTrans model, and the Wilcoxon rank-sum non-reference test shows a p-value of 4x10^-15. Therefore, the embodiment of the application brings outstanding contribution to the prediction model learning of the antibody sequence.

For scenario 1, namely, the binding sites of the antibody sequences are predicted by using a trained prediction model, a five-fold cross test method is adopted, 1023 sequences with labels are divided into five equal parts, four parts of the sequences are used for training, and the other part of the sequences is used for testing. This is repeated five times, and as shown in fig. 11B, the average MCC (Matthews correlation coefficient) value of the predicted performance is 0.976, wherein the calculation formula of MCC is as follows:

wherein, TP true positive, TN true negative, FP false positive, FN false negative

As shown in FIG. 11B, the target prediction model of the present application embodiment is improved by over 50% compared to proABC, LSTM baseline model, Parapred (prediction model), AG-Fast-Parapred (AG Fast prediction model).

The preferred embodiments of the present application have been described in detail with reference to the accompanying drawings, however, the present application is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the technical idea of the present application, and these simple modifications are all within the protection scope of the present application. For example, the various features described in the foregoing detailed description may be combined in any suitable manner without contradiction, and various combinations that may be possible are not described in this application in order to avoid unnecessary repetition. For example, various embodiments of the present application may be arbitrarily combined with each other, and the same should be considered as the disclosure of the present application as long as the concept of the present application is not violated.

It should also be understood that, in the various method embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply an execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Method embodiments of the present application are described in detail above in conjunction with fig. 3-10, and apparatus embodiments of the present application are described in detail below in conjunction with fig. 12-14.

Fig. 12 is a schematic structural diagram of a training apparatus for a prediction model of an antibody according to an embodiment of the present application. The training apparatus 20 may be a computing device or a component of a computing device (e.g., an integrated circuit, a chip, etc.) for performing the model training method described above.

An obtaining unit 21, configured to obtain N first antibody sequences, where N is a positive integer, and a binding site between the first antibody sequence and an antigen is not marked in the first antibody sequence;

and a training unit 22, configured to pre-train the prediction model by using the N first antibody sequences to obtain a pre-trained prediction model, where the pre-trained prediction model is used to predict a prediction value of a masked amino acid in an antibody sequence.

In some embodiments, the variable region of the first antibody sequence has a higher learning frequency than the non-variable region of the first antibody sequence during pre-training of the predictive model.

In some embodiments, the training unit 22 is specifically configured to perform unsupervised pre-training on the prediction model by using the N first antibody sequences, so as to obtain a pre-trained prediction model.

In some embodiments, the training unit 22 is specifically configured to pre-train the prediction model by using the N first antibody sequences based on a MASK strategy, so as to obtain a pre-trained prediction model.

In some embodiments, the training unit 22 is specifically configured to mask, for each of the N first antibody sequences, amino acids in a variable region of the first antibody sequence according to a first masking frequency, and mask amino acids in an invariant region of the first antibody sequence according to a second masking frequency, so as to obtain a predicted value of the masked amino acids predicted by the prediction model; and pre-training the prediction model according to the loss between the predicted value and the true value of the covered amino acid to obtain the pre-trained prediction model.

Optionally, the first cloaking frequency is greater than the second cloaking frequency.

In some embodiments, the obtaining unit 21 is further configured to obtain M second antibody sequences, where M is a positive integer, and the second antibody sequences are labeled with binding sites of the second antibody sequences to antigens;

the training unit 22 is further configured to use the M second antibody sequences to perform fine tuning on the pre-trained prediction model to obtain a target prediction model, where the target prediction model is used to predict binding sites of the antibody sequences and the antigens.

In some embodiments, the training unit 22 is specifically configured to, for each second antibody sequence in the M second antibody sequences, input the second antibody sequence into the pre-trained prediction model to obtain a predicted binding site of the second antibody sequence and an antigen predicted by the pre-trained prediction model; and finely adjusting the pre-trained prediction model according to the predicted loss between the binding site of the second antibody sequence and the antigen and the predicted actual value of the binding site of the second antibody sequence and the antigen to obtain a target prediction model.

In some embodiments, the second antibody sequence is labeled with a tag sequence at the binding site of the second antibody sequence to the antigen, wherein the tag sequence is of the same length as the second antibody sequence, and each value in the tag sequence indicates whether the amino acid corresponding to that value binds to the antigen.

Optionally, M is less than N.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the training device shown in fig. 12 may correspond to a corresponding main body in executing the model training method according to the embodiment of the present application, and the foregoing and other operations and/or functions of each module in the training device are respectively for implementing corresponding processes in each method in the above model training, and are not described herein again for brevity.

Fig. 13 is a schematic structural diagram of an antibody modification apparatus provided in an embodiment of the present application. The antibody engineering device 30 may be a computing device or a component of a computing device (e.g., an integrated circuit, a chip, etc.) for performing the antibody engineering methods described above.

An obtaining unit 31 for obtaining a target antibody sequence to be modified;

a receiving unit 32, configured to receive a masking operation of the target site amino acid to be modified in the target antibody sequence by the user;

a prediction unit 33, configured to, in response to the masking operation, input the target antibody sequence with the target site amino acid masked into a pre-trained prediction model, so as to obtain a prediction value of the target site amino acid output by the pre-trained prediction model;

the pre-trained prediction model is obtained by training a first antibody sequence, and the binding site of the first antibody sequence and an antigen is not marked in the first antibody sequence.

In some embodiments, the prediction unit 33 is specifically configured to input the target antibody sequence with the masked target site amino acid into the pre-trained prediction model, and obtain, for each of Q preset values, a probability value when the target site amino acid output by the pre-trained prediction model is replaced by the preset value, where Q is a positive integer; and displaying the first K preset values with the maximum probability value in the Q preset values as the predicted values of the target site amino acids, wherein K is a positive integer less than or equal to Q.

In some embodiments, the apparatus further includes a display unit 34, and the display unit 34 is configured to display a probability value of each of the first K preset values.

In some embodiments, the display unit 34 is also used to display an input box; a receiving unit 32, further configured to receive the target antibody sequence to be modified, which is input by the user in the input box.

In some embodiments, the variable region of the first antibody sequence has a higher learning frequency than the non-variable region of the first antibody sequence during training of the pre-trained predictive model.

In some embodiments, the prediction unit 33 is specifically configured to, when the prediction triggering operation of the user is detected, input the target antibody sequence with the target site amino acid masked into a pre-trained prediction model in response to the masking operation, and obtain a predicted value of the target site amino acid output by the pre-trained prediction model.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the antibody modification apparatus shown in fig. 13 may correspond to a corresponding main body for performing the antibody modification method in the embodiment of the present application, and the foregoing and other operations and/or functions of each module in the antibody modification apparatus are respectively for realizing corresponding processes in each method in the antibody modification, and are not described herein again for brevity.

FIG. 14 is a schematic structural diagram of a prediction device for antibody binding sites provided in the examples of the present application. The prediction means 40 may be a computing device or a component of a computing device (e.g., an integrated circuit, a chip, etc.) for performing the above-described method for predicting antibody binding sites.

An acquisition unit 41 for acquiring a target antibody sequence to be predicted;

a prediction unit 42, configured to input the target antibody sequence into a target prediction model, and predict a binding site of the target antibody sequence and an antigen;

In some embodiments, the prediction apparatus further comprises a display unit 43 for displaying the binding site of the target antibody sequence to an antigen predicted by the target prediction model.

In some embodiments, the prediction apparatus further comprises a receiving unit 44;

the display unit 43 is used for displaying the prediction frame;

and a receiving unit 44, configured to receive the target antibody sequence input by the user in the prediction box.

In some embodiments, the prediction unit 42 is specifically configured to, when the prediction triggering operation of the user is detected, input the target antibody sequence into a target prediction model to predict the binding site of the target antibody sequence to the antigen.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the prediction apparatus shown in fig. 14 may correspond to the corresponding subject performing the prediction method of the embodiment of the present application, and the foregoing and other operations and/or functions of the respective modules in the prediction apparatus 40 are respectively for implementing the corresponding procedures in the respective methods in the antibody binding sites, and are not described herein again for brevity.

The apparatus of the embodiments of the present application is described above in connection with the drawings from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 15 is a block diagram of a computing device according to an embodiment of the present application, where the computing device may be the server shown in fig. 1, and is used to execute the method according to the foregoing embodiment, specifically referring to the description in the foregoing method embodiment.

The computing device 200 shown in fig. 15 includes a memory 201, a processor 202, and a communication interface 203. The memory 201, the processor 202 and the communication interface 203 are connected with each other in communication. For example, the memory 201, the processor 202, and the communication interface 203 may be connected by a network connection. Alternatively, the computing device 200 may also include a bus 204. The memory 201, the processor 202 and the communication interface 203 are connected to each other by a bus 204. Fig. 14 is a computing device 200 with a memory 201, a processor 202, and a communication interface 203 communicatively coupled to each other via a bus 204.

The Memory 201 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 201 may store programs, and the processor 202 and the communication interface 203 are used to perform the above-described methods when the programs stored in the memory 201 are executed by the processor 202.

The processor 202 may be implemented as a general purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits.

The processor 202 may also be an integrated circuit chip having signal processing capabilities. In implementation, the method of the present application may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 202. The processor 202 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 201, and the processor 202 reads the information in the memory 201 and completes the method of the embodiment of the application in combination with the hardware thereof.

The communication interface 203 enables communication between the computing device 200 and other devices or communication networks using transceiver modules such as, but not limited to, transceivers. For example, the data set may be acquired through the communication interface 203.

When computing device 200 includes bus 204, as described above, bus 204 may include a pathway to transfer information between various components of computing device 200 (e.g., memory 201, processor 202, communication interface 203).

There is also provided according to the present application a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

There is also provided according to the present application a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of the above-described method embodiment.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In addition, the method embodiments and the device embodiments may also refer to each other, and the same or corresponding contents in different embodiments may be referred to each other, which is not described in detail.

33页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：构建骨髓增生异常综合征转白基因预测模型的方法

Method and device for model training, antibody modification and binding site prediction

相关技术

网友询问留言