Method, device, medium and apparatus for generating protein sequence of high-thermal-stability enzyme

文档序号:139174 发布日期:2021-10-22 浏览:29次 中文

阅读说明:本技术 高热稳定性酶的蛋白序列生成方法、装置、介质和设备 (Method, device, medium and apparatus for generating protein sequence of high-thermal-stability enzyme ) 是由 罗小舟 余函 于 2021-06-29 设计创作,主要内容包括:本发明公开了一种高热稳定性酶的蛋白序列生成方法、装置、介质和设备。所述蛋白序列生成方法包括:获取训练样本,所述训练样本包括耐受温度大于预定值的特定类酶的蛋白序列数据;利用训练样本对预先构建好的生成式对抗网络模型进行训练,获得蛋白序列生成模型;利用所述蛋白序列生成模型生成批量蛋白序列数据。本方法利用现有的耐受温度大于预定值的特定类酶的蛋白序列数据训练生成式对方模型,得到蛋白序列生成模型,可以批量生成高热稳定的特定类酶的序列,生成方法简单,仅从蛋白序列出发且在计算机上完成设计,并可进一步按相似性分布进行实验验证,结果可靠性更强且容易分析。(The invention discloses a method, a device, a medium and equipment for generating a protein sequence of a high-thermal-stability enzyme. The method for generating the protein sequence comprises the following steps: obtaining a training sample, wherein the training sample comprises protein sequence data of a specific enzyme class with a tolerance temperature larger than a preset value; training a pre-constructed generative confrontation network model by using a training sample to obtain a protein sequence generative model; generating a batch of protein sequence data using the protein sequence generation model. The method trains a generation type counterpart model by using the existing protein sequence data of the specific enzyme with the tolerance temperature higher than the preset value to obtain a protein sequence generation model, can generate the sequence of the specific enzyme with high thermal stability in batches, is simple, only starts from the protein sequence and finishes design on a computer, can further carry out experimental verification according to similarity distribution, has stronger result reliability and is easy to analyze.)

1. A method for producing a protein sequence of a thermostable enzyme, comprising:

obtaining a training sample, wherein the training sample comprises protein sequence data of a specific enzyme class with a tolerance temperature larger than a preset value;

training a pre-constructed generative confrontation network model by using a training sample to obtain a protein sequence generative model;

generating a batch of protein sequence data using the protein sequence generation model.

2. The method for generating protein sequences of enzymes with high thermal stability according to claim 1, wherein the generative confrontation network model comprises a generator and a discriminator, and the training of the pre-constructed generative confrontation network model with training samples to obtain the protein sequence generation model comprises:

inputting random noise into a generator, outputting generated data by the generator, and selecting partial data from the training samples as real data;

inputting the generated data and the real data into the discriminator together, and outputting a discrimination result by the discriminator;

adjusting network parameters of the generator and the discriminator according to a discrimination result to finish a round of training;

and repeating the training steps until a preset training condition is met to obtain the protein sequence generation model.

3. The method of generating a protein sequence of a thermostable enzyme according to claim 2, wherein the method of obtaining a training sample comprises:

obtaining proteome sequences of various microorganisms with tolerance temperature greater than a predetermined value;

determining target enzymes in the same category as the specific enzymes from the proteome sequences, and extracting target protein sequences of the target enzymes;

and clustering the target protein sequence by using a sequence clustering algorithm to obtain a plurality of cluster-like protein sequences, and selecting the protein sequences with the cluster-like size larger than a threshold value as training samples.

4. The method of claim 1 or 3, wherein the training sample further comprises protein sequence data of an initial sample enzyme belonging to the same class as the specific class of enzyme, and before training a pre-constructed generative confrontation network model with the training sample, the method further comprises:

and pre-training a generative confrontation network model by using the protein sequence data of the initial sample enzyme to obtain the pre-constructed generative confrontation network model.

5. The method of claim 4, wherein the specific class of enzyme and the initial sample enzyme belong to any of the enzymes in layer four enzyme codes of enzyme Commission nomenclature.

6. The method of claim 1, further comprising the steps of:

comparing each protein sequence in the batch of protein sequence data to partial protein sequence data, respectively, to determine a similarity of each protein sequence;

sequencing the protein sequences according to the sequence of similarity from high to low, and dividing the protein sequences into a plurality of intervals;

selecting a plurality of protein sequences from each interval, and performing synthesis expression according to each selected protein sequence to generate corresponding enzyme;

measuring the dissolution temperature of each enzyme, and screening out the protein sequence corresponding to the enzyme with the dissolution temperature being more than or equal to the preset temperature as a new protein sequence with high thermal stability.

7. The method of claim 4, wherein the comparing each protein sequence in the batch of protein sequence data with the partial protein sequence data to determine the similarity of each protein sequence comprises:

respectively calculating the similarity between each protein sequence and each sequence in the partial protein sequence data to obtain a group of similarity data;

and taking the maximum similarity in the set of similarity data as the similarity of each protein sequence.

8. An apparatus for producing a protein sequence of a thermostable enzyme, comprising:

a sample acquisition unit that acquires a training sample including protein sequence data of a specific class of enzymes whose tolerance temperature is greater than a predetermined value;

the training unit is used for training a pre-constructed generative confrontation network model by utilizing a training sample to obtain a protein sequence generative model;

and the batch generation unit is used for generating batch protein sequence data by using the protein sequence generation model.

9. A computer-readable storage medium, wherein a thermostable enzyme protein sequence generation program is stored, and when executed by a processor, the thermostable enzyme protein sequence generation program implements the thermostable enzyme protein sequence generation method according to any one of claims 1 to 7.

10. A computer device comprising a computer readable storage medium, a processor, and a thermostable enzyme protein sequence generation program stored in the computer readable storage medium, wherein the thermostable enzyme protein sequence generation program when executed by the processor implements the thermostable enzyme protein sequence generation method according to any one of claims 1 to 7.

Technical Field

The invention belongs to the technical field of biological medicines, and particularly relates to a method for generating a protein sequence of a high-thermal-stability enzyme, a protein sequence generating device, a computer readable storage medium and computer equipment.

Background

Enzymes with high thermal stability play an extremely important role in the fields of biofuels, biochemical engineering and the like. The traditional acquisition method is mainly implemented by separating from thermophilic bacteria and putting the thermophilic bacteria into industrial use after experimental verification, improvement and optimization. However, the number of enzymes separated by the traditional separation method is limited, and the requirement of more and more abundant industrial specific scenes cannot be met, so that the de novo design of some brand-new enzymes with high thermal stability becomes more important. At present, two main methods are those based on rational design, mainly based on structural modification, and those based on directed evolution, but these modification capabilities have certain limitations, and the number of enzymes with high thermal stability is limited. The method based on rational design needs to know information such as enzyme structure in more detail, is very familiar with the existing modification method, has more complex modification process and is difficult to generate in batches; the method based on the directed evolution screens corresponding enzymes from a random mutation library by establishing the random mutation library, but the method has low success rate and huge workload and is difficult to generate in batches. Meanwhile, neither method can perform systematic comparative analysis on the sequence from the theoretical point of view.

Therefore, there is a need to develop a method for designing a novel enzyme with high thermal stability in a batch manner.

Disclosure of Invention

(I) technical problems to be solved by the invention

The technical problem solved by the invention is as follows: how to rapidly generate protein sequences of enzymes with high thermal stability in batches.

(II) the technical scheme adopted by the invention

A method for producing a protein sequence of a thermostable enzyme, comprising:

obtaining a training sample, wherein the training sample comprises protein sequence data of a specific enzyme class with a tolerance temperature larger than a preset value;

training a pre-constructed generative confrontation network model by using a training sample to obtain a protein sequence generative model;

generating a batch of protein sequence data using the protein sequence generation model.

The generating confrontation network model comprises a generator and a discriminator, and the specific method for training the pre-constructed generating confrontation network model by utilizing the training sample to obtain the protein sequence generating model comprises the following steps:

inputting random noise into a generator, outputting generated data by the generator, and selecting partial data from the training samples as real data;

inputting the generated data and the real data into the discriminator together, and outputting a discrimination result by the discriminator;

adjusting network parameters of the generator and the discriminator according to a discrimination result to complete a round of training;

and repeating the training steps until a preset training condition is met to obtain the protein sequence generation model.

Preferably, the method of obtaining training samples comprises:

obtaining proteome sequences of various microorganisms with tolerance temperature greater than a predetermined value;

determining target enzymes in the same category as the specific enzymes from the proteome sequences, and extracting target protein sequences of the target enzymes;

and clustering the target protein sequence by using a sequence clustering algorithm to obtain a plurality of cluster-like protein sequences, and selecting the protein sequences with the cluster-like size larger than a threshold value as training samples.

Preferably, the training sample further includes protein sequence data of an initial sample enzyme in the same class as the specific class of enzyme, and before the training of the pre-constructed generative confrontation network model by using the training sample, the protein sequence generation method further includes:

and pre-training a generative confrontation network model by using the protein sequence data of the initial sample enzyme to obtain the pre-constructed generative confrontation network model.

Preferably, the specific class of enzyme and the initial sample enzyme are both enzymes under any layer four enzyme numbering in the enzyme Commission nomenclature.

Preferably, the method for generating a protein sequence further comprises:

comparing each protein sequence in the batch of protein sequence data to the partial protein sequence data, respectively, to determine a similarity of each protein sequence;

sequencing the protein sequences according to the sequence of similarity from high to low, and dividing the protein sequences into a plurality of intervals;

selecting a plurality of protein sequences from each interval, and performing synthesis expression according to each selected protein sequence to generate corresponding enzyme;

measuring the dissolution temperature of each enzyme, and screening out the protein sequence corresponding to the enzyme with the dissolution temperature being more than or equal to the preset temperature as a new protein sequence with high thermal stability.

Preferably, the method of comparing each protein sequence in the batch of protein sequence data with the partial protein sequence data to determine the similarity of each protein sequence comprises:

respectively calculating the similarity between each protein sequence and each sequence in the partial protein sequence data to obtain a group of similarity data;

and taking the maximum similarity in the set of similarity data as the similarity of each protein sequence.

The application also discloses a device for generating a protein sequence of a high-thermal-stability enzyme, which comprises:

a sample acquisition unit that acquires a training sample including protein sequence data of a specific class of enzymes whose tolerance temperature is greater than a predetermined value;

the training unit is used for training a pre-constructed generative confrontation network model by utilizing a training sample to obtain a protein sequence generative model;

and the batch generation unit is used for generating batch protein sequence data by using the protein sequence generation model.

The application also discloses a computer readable storage medium, which stores a protein sequence generation program of the high heat stability enzyme, and the protein sequence generation program of the high heat stability enzyme realizes the protein sequence generation method of the high heat stability enzyme when being executed by a processor.

The present application also discloses a computer device comprising a computer readable storage medium, a processor, and a protein sequence generation program of a thermostable enzyme stored in the computer readable storage medium, the protein sequence generation program of a thermostable enzyme realizing the above-mentioned protein sequence generation method of a thermostable enzyme when executed by the processor.

(III) advantageous effects

The invention discloses a method for generating a protein sequence of a high-thermal-stability enzyme, which has the following technical effects compared with the traditional generation method:

the method has the advantages that the conventional protein sequence data of the specific enzyme with the tolerance temperature higher than the preset value is utilized to train the generation type counterpart model to obtain the protein sequence generation model, the high-heat-stability specific enzyme sequences can be generated in batches, the generation method is simple, the design is finished on a computer only from the protein sequence, the experimental verification can be further carried out according to the similarity distribution, the result reliability is higher, and the analysis is easy.

Drawings

FIG. 1 is a flowchart of a method for producing a protein sequence of a thermostable enzyme according to a first embodiment of the present invention;

FIG. 2 is another flow chart of another method for generating a protein sequence according to the first embodiment of the present invention;

FIG. 3 is a flowchart of a method for producing a protein sequence of a thermostable enzyme according to a second embodiment of the present invention;

FIG. 4 is a schematic view of a protein sequence generating apparatus for a thermostable enzyme according to a third embodiment of the present invention;

fig. 5 is a schematic diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Before describing in detail the various embodiments of the present application, the inventive concepts of the present application are first briefly described: the existing design mode of high-thermal-stability enzyme needs to know information such as enzyme structure, the modification process is complex, and batch generation is difficult. According to the method for generating the protein sequence of the high-thermal-stability enzyme, protein sequence data of a specific enzyme with tolerance temperature larger than a preset value is screened out as a training sample, a pre-constructed generative confrontation network model is trained by the training sample to obtain a protein sequence generation model, so that the model learns the basic characteristics of the high-thermal-stability enzyme, and finally the trained protein sequence generation model is used for generating the protein sequence data in batches.

Specifically, as shown in fig. 1, the method for generating a protein sequence of a thermostable enzyme according to this embodiment includes the following steps:

step S10, obtaining a training sample, wherein the training sample comprises protein sequence data of a specific enzyme with tolerance temperature larger than a preset value;

step S20, training a pre-constructed generative confrontation network model by using a training sample to obtain a protein sequence generative model;

and step S30, generating batch protein sequence data by using the protein sequence generation model.

Specifically, the main objective of step S10 is to construct a protein sequence database of thermostable enzymes for providing training samples. The step S10 includes: obtaining a proteome sequence of a microorganism with a tolerance temperature greater than a predetermined value; determining target enzymes in the same category as the specific enzymes from the proteome sequences, and extracting target protein sequences of the target enzymes; and clustering the target protein sequence by using a sequence clustering algorithm to obtain protein sequences of a plurality of clusters, and selecting the protein sequence of the largest cluster as a training sample.

Illustratively, the growth temperature of a large number of microorganisms can be obtained from an existing database, such as a Martin KM Engqvist database or any other database containing the growth temperature of microorganisms; then, setting a suitable predetermined value, and selecting microorganisms with tolerance temperature above a predetermined value, such as above 40 ℃; then, extracting the proteome sequence of the microorganism from a proteome database, such as a Uniprot/Proteomes database and other databases containing Proteomes; next, extracting target protein sequences corresponding to target enzymes of which the specific enzymes are in the same class from all the protein groups; and finally, setting a reasonable similarity threshold value in a range of 0.0-1.0, such as 0.5, by using a sequence clustering algorithm, such as a clustering method of mmseq2 clustering and other suitable sequences, and selecting a protein sequence with a cluster larger than the threshold value from the target protein sequence as a training sample.

Further, in order to improve the accuracy of the protein sequence generation model, the protein sequence as the training sample should have a higher similarity, the specific class of enzyme and the target enzyme in this embodiment belong to any enzyme under the fourth layer enzymology number in the nomenclature of the enzyme commission on enzyme science, for example, both belong to 3.5.4.5-cytidine deaminase (cytidine deaminase) under the fourth layer enzymology number, so as to avoid that the training sample has a larger difference and affects the training effect. The enzymes under the fourth layer of enzyme numbering in the enzyme Commission nomenclature are multiple, protein sequences of various enzymes are different, but can catalyze the same type of reaction, and the similarity is high, so that the protein sequence generation model obtained by training has high accuracy.

Further, after step S10, the training sample is pre-processed, the protein sequence data of the specific enzyme is aligned, all protein sequences are filled with non-amino acid characters, and the aligned protein sequence data is converted into a one-hot coded form as input data suitable for model training.

In step S20, the generative confrontation network model includes a generator and a discriminator, and the method for training the pre-constructed generative confrontation network model by using the training samples includes: inputting random noise into a generator, outputting generated data by the generator, and selecting partial data from the training samples as real data; inputting the generated data and the real data into the discriminator together, and outputting a discrimination result by the discriminator; adjusting network parameters of the generator and the discriminator according to a discrimination result to complete a round of training; and repeating the training steps until a preset training condition is met to obtain the generative model. Illustratively, the generative confrontation network model employs a WGAN-GP network.

During each round of training, the input format of real data is [ Batch _ size, Seq _ Len, charmp ], wherein the Batch _ size is the number of sequences input by the model each time, the Seq _ Len is the unified sequence length, the chamap is the dimension of a dictionary and comprises 20 types of amino acids and filled characters, and the input of the generator is random noise meeting the standard normal distribution. After multiple rounds of iterative learning, the generator of the model can generate data similar to the real protein sequence of the specific enzyme, so that the extraction of the high thermal stability characteristics of the specific enzyme is completed, and the training of the protein sequence generation model is completed.

As a preferred embodiment, as shown in fig. 2, in order to further determine the reliability of the batch-produced protein sequence, the protein sequence production method further comprises the following steps:

step S40, comparing each protein sequence in the batch of protein sequence data with the partial protein sequence data, respectively, to determine a similarity of each protein sequence. Illustratively, the method for determining the similarity of each protein sequence comprises the following steps: respectively calculating the similarity between each protein sequence and each sequence in the partial protein sequence data to obtain a group of similarity data; and taking the maximum similarity in the set of similarity data as the similarity of each protein sequence.

And step S50, sequencing the protein sequences according to the sequence of similarity from high to low, and dividing the protein sequences into a plurality of intervals. For example, a section with a similarity of 90% or more, a section with a similarity of 80% to 90%, a section with a similarity of 70% to 80%, and so on form a plurality of sections.

S60, selecting a plurality of protein sequences from each interval, and performing synthetic expression according to each selected protein sequence to generate corresponding enzyme;

and step S70, measuring the dissolution temperature of each enzyme, and screening out the protein sequence corresponding to the enzyme with the dissolution temperature being more than or equal to the preset temperature as a new protein sequence with high thermal stability.

The method for generating the protein sequence of the high-thermal-stability enzyme disclosed by the embodiment can be used for generating the sequence of the high-thermal-stability specific enzyme in batches, is simple, can be designed on a computer only from the protein sequence, can be used for further experimental verification according to similarity distribution, and is higher in result reliability and easy to analyze.

Further, the main difference of the method for generating the protein sequence of the high thermostable enzyme in the second embodiment relative to the first embodiment is that the model is initially trained, so that the model parameters can be initialized better, and the influence caused by less protein sequence data of a specific enzyme with the tolerance temperature higher than a predetermined value is compensated. Specifically, as shown in fig. 3, the method for generating a protein sequence of a thermostable enzyme according to the second embodiment includes the following steps:

step S10', obtaining a training sample, wherein the training sample comprises protein sequence data of a specific enzyme class with tolerance temperature larger than a preset value and protein sequence data of an initial sample enzyme in the same class with the specific enzyme class;

step S20', pre-training a generative confrontation network model by using the protein sequence data of the initial sample enzyme to obtain the pre-constructed generative confrontation network model, and then training the pre-constructed generative confrontation network model by using the protein sequence data of the specific enzyme to obtain a protein sequence generation model;

step S30', generate batch protein sequence data using the protein sequence generation model.

Specifically, in the case where there is less protein sequence data of a specific class of enzymes having a tolerance temperature greater than a predetermined value, other enzymes in the same class as the specific class of enzymes may be selected as initial sample enzymes, and the generative confrontation network model is pre-trained using the protein sequence data of the initial sample enzymes to now make the model learn the basic characteristics of the specific class of enzymes. As a preferred embodiment, in order to avoid the large difference between the enzyme of the initial sample and the enzyme of the specific class, the enzyme of the initial sample and the enzyme of the specific class are defined to belong to the enzyme under any layer four enzyme science number in the nomenclature of the enzyme Commission, for example, both of them belong to 3.5.4.5-cytidinediaminase (cytidine deaminase) under the layer four enzyme science number, and the tolerant temperature of the enzyme of the initial sample may not be considered, so long as the protein sequences of other enzymes under the same layer four enzyme science number as the enzyme of the specific class can be used as the pre-training.

Illustratively, the protein sequence of the initial sample enzyme is obtained by: firstly, downloading all naturally-occurring sequences of an initial sample enzyme from a protein database, such as a Uniprot database and the like, wherein the database can obtain protein sequences; then, a target sequence belonging to the initial sample enzyme is extracted by taking a classical five-kingdom division criterion as a division basis, namely one of the prokaryote kingdom, the protist kingdom, the fungus kingdom, the plant kingdom and the animal kingdom. In the second embodiment, the protein sequence of the original sample enzyme in the kingdom prokaryote is preferably used for pre-training.

In step S20', the pre-training process of the generative confrontation network model using the protein sequence data of the initial sample enzyme is as follows: a generator of the generative confrontation network model obtains generated data according to random noise, and partial data are selected from protein sequence data of initial sample enzyme to serve as real data; inputting the generated data and the real data into a discriminator of a generative confrontation network model together to obtain a discrimination result; updating the network parameters of a discriminator and a generator of the generative confrontation network model according to the discrimination result to complete a round of training; and repeating the training steps until the initial training condition is met to obtain a pre-constructed generative confrontation network model. And then, continuously training the pre-constructed generative confrontation network model by using the protein sequence data of the specific enzyme with the tolerance temperature greater than the preset value to obtain a final protein sequence generative model. The details of the data preprocessing, the data input format, and the like in the training process of step S20' may refer to the training process of step S20 in the embodiment, which is not described herein again.

Further, step S30 'in the second embodiment is the same as step S30 in the first embodiment, and after step S30', the method for generating a protein sequence further includes the steps S40 to S70 in the first embodiment, which are not repeated herein.

Compared with the method of the first embodiment, the method for generating the protein sequence of the enzyme with high thermal stability disclosed in the second embodiment has the following advantages: the influence caused by less protein sequence data of a specific enzyme with the tolerance temperature higher than the preset value is compensated, so that the accuracy of generating a model by the protein sequence is improved, and the protein sequence with higher similarity is generated.

Further, as shown in fig. 4, the third embodiment also discloses a protein sequence generating apparatus for thermostable enzyme, which includes a sample acquiring unit 100, a training unit 200, and a batch generating unit 300. The sample acquiring unit 100 is configured to acquire a training sample including protein sequence data of a specific class of enzymes having a tolerance temperature greater than a predetermined value; the training unit 200 trains a pre-constructed generative confrontation network model by using a training sample to obtain a protein sequence generation model; the batch generation unit 300 is configured to generate a batch of protein sequence data using the protein sequence generation model. The specific working processes of the sample obtaining unit 100, the training unit 200, and the batch generating unit 300 may refer to the description related to the first embodiment, and are not repeated herein.

The fourth embodiment also discloses a computer readable storage medium, wherein the computer readable storage medium stores a protein sequence generation program of a high thermal stability enzyme, and the protein sequence generation program of the high thermal stability enzyme realizes the protein sequence generation method of the high thermal stability enzyme when being executed by a processor.

In the fifth embodiment, a computer device is further disclosed, and in a hardware level, as shown in fig. 5, the terminal includes a processor 12, an internal bus 13, a network interface 14, and a computer-readable storage medium 11. The processor 12 reads a corresponding computer program from the computer-readable storage medium and then runs, forming a request processing apparatus on a logical level. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices. The computer-readable storage medium 11 stores thereon a protein sequence generation program for a thermostable enzyme, which when executed by a processor implements the method for generating a protein sequence for a thermostable enzyme described above.

Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer-readable storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Although a few embodiments of the present invention have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents, and that such changes and modifications are intended to be within the scope of the invention.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于三支半概念的生物信息类提取方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!