Method, device, equipment and medium for predicting potential BGC in genome sequence

文档序号:363813 发布日期:2021-12-07 浏览:28次 中文

阅读说明:本技术 基因组序列中潜在bgc的预测方法、装置、设备及介质 (Method, device, equipment and medium for predicting potential BGC in genome sequence ) 是由 杨子翊 廖奔犇 张胜誉 辛志伟 梁恒宇 于 2021-08-03 设计创作,主要内容包括:本申请公开了一种基因组序列中潜在BGC的预测方法、装置、设备及介质,涉及人工智能领域。方法包括:对基因组序列中的各个基因进行结构域预测,得到各个基因中包含的Pfam结构域;确定各个Pfam结构域的Pfam得分,Pfam得分用于表征Pfam结构域属于BGC的概率;基于各个Pfam结构域的Pfam得分,确定基因组序列中的候选BGC;对候选BGC进行BGC类别预测,并基于类别预测结果确定候选BGC中的潜在BGC。本申请实施例采用双重串行预测机制,先根据Pfam得分实现BGC的一级过滤,然后在一级过滤结果的基础上通过类别预测实现BGC的二级过滤,有助于降低BGC预测结果的假阳率。(The application discloses a method, a device, equipment and a medium for predicting potential BGC in a genome sequence, and relates to the field of artificial intelligence. The method comprises the following steps: performing structural domain prediction on each gene in the genome sequence to obtain a Pfam structural domain contained in each gene; determining a Pfam score of each Pfam domain, wherein the Pfam score is used for representing the probability that the Pfam domain belongs to the BGC; determining candidate BGCs in the genome sequence based on the Pfam scores of the respective Pfam domains; and performing BGC category prediction on the candidate BGCs, and determining potential BGCs in the candidate BGCs based on category prediction results. According to the embodiment of the application, a dual serial prediction mechanism is adopted, first-level filtering of BGC is achieved according to Pfam scores, then second-level filtering of BGC is achieved through category prediction on the basis of first-level filtering results, and the reduction of the false positive rate of BGC prediction results is facilitated.)

1. A method for predicting potential BGCs in a genomic sequence, the method comprising:

performing structural domain prediction on each gene in the genome sequence to obtain a protein family database Pfam structural domain contained in each gene;

determining a Pfam score for each of the Pfam domains, the Pfam score characterizing the probability that the Pfam domain belongs to the biosynthetic gene cluster, BGC;

determining a candidate BGC in the genomic sequence based on the Pfam score for each of the Pfam domains, the candidate BGC consisting of at least one gene;

and performing BGC category prediction on the candidate BGCs, and determining potential BGCs in the candidate BGCs based on category prediction results.

2. The method of claim 1, wherein said determining a Pfam score for each of said Pfam domains comprises:

acquiring biological information of the Pfam structural domain, wherein the biological information comprises structural domain information, family description information and family identification;

and inputting the biological information into a Pfam scoring model to obtain the Pfam score output by the Pfam scoring model, wherein the Pfam scoring model is obtained by training based on a sample genome sequence containing a BGC label.

3. The method of claim 2, wherein the inputting the biological information into a Pfam scoring model to obtain the Pfam score output by the Pfam scoring model comprises:

processing the biological information through an embedding layer, a coding layer and a connecting layer of the Pfam scoring model to obtain a target vector of the Pfam structure domain, wherein the embedding layer is used for embedding the biological information to obtain an embedded vector, the coding layer is used for coding the embedded vector to obtain a coding vector, and the connecting layer is used for connecting the coding vector to obtain the target vector;

performing feature extraction on the target vector through a feature extraction layer of the Pfam scoring model to obtain structural domain features of the Pfam structural domain;

and performing pooling and full-connection processing on the structural domain features through a pooling layer and a full-connection layer of the Pfam scoring model to obtain the Pfam score.

4. The method of claim 3, wherein the feature extraction layer is composed of a Bi-directional long-short term memory recurrent neural network (Bi-LSTM) and a unidirectional long-short term memory recurrent network (LSTM), and the pooling layer is used for time-series averaging pooling of the domain features.

5. The method of claim 2, further comprising:

constructing a sample genome sequence, wherein the sample gene combination sequence is obtained by splicing a positive sample and a negative sample, the positive sample belongs to a BGC data set, and the negative sample belongs to a non-BGC data set;

scoring each Pfam structure domain in the sample genome sequence through the Pfam scoring model to obtain a sample Pfam score;

determining a sample predicted BGC in the sample genomic sequence based on the sample Pfam score;

and taking the positive sample and the negative sample as supervision of sample prediction BGC, and training the Pfam scoring model.

6. The method of any one of claims 1 to 5, wherein prior to said determining the Pfam score of each of said Pfam domains, said method further comprises:

dividing the genome sequence by adopting a sliding window based on the target number and the target step length to obtain at least two sequence segments, wherein the sequence segments comprise the Pfam structure domains in the target number, and the offset between adjacent sequence segments is the target step length;

said determining a Pfam score for each of said Pfam domains further comprises:

determining the Pfam score of each of the Pfam domains in the sequence fragment in units of the sequence fragment;

in response to the fact that the Pfam domain belongs to at least two sequence fragments, performing average calculation on the Pfam scores of the Pfam domains in the at least two sequence fragments, and determining the average calculation result as the target Pfam score of the Pfam domain.

7. The method of any one of claims 1 to 5, wherein said determining candidate BGCs in said genomic sequence based on said Pfam score of each of said Pfam domains comprises:

determining the average value of the Pfam scores of each Pfam structural domain in the same gene as the gene score of the gene;

merging the genes with the gene scores higher than a score threshold value based on a merging rule to obtain merged genes;

in response to the number of nucleotides in the merger gene being greater than a number threshold and the merger gene not comprising a filtering domain, the merger gene being determined as the candidate BGC, the filtering domain being a region known not to comprise BGC.

8. The method according to any one of claims 1 to 5, wherein the performing BGC class prediction on the candidate BGCs and determining potential BGCs in the candidate BGCs based on the class prediction result comprises:

conducting BGC category prediction on the candidate BGCs through a random forest classifier to obtain a category prediction result, wherein the classifier categories of the random forest classifier comprise BGC categories and non-BGC categories;

determining the candidate BGC as the potential BGC in response to the BGC category identification contained in the category prediction result;

the method further comprises the following steps:

and responding to the non-BGC category identification contained in the category prediction result, and filtering the candidate BGCs.

9. The method as claimed in claim 8, wherein the conducting BGC category prediction on the candidate BGCs through a random forest classifier to obtain the category prediction result comprises:

generating a domain statistical matrix based on the statistical information of the Pfam domain in the candidate BGC;

and inputting the structural domain statistical matrix into the random forest classifier to perform BGC class prediction to obtain the class prediction result.

10. The method of claim 8, wherein the Pfam score is derived from scoring the Pfam domain by a Pfam scoring model;

the method further comprises the following steps:

and responding to the completion of the training of the Pfam scoring model, training the random forest classifier based on positive samples, enhanced negative samples and negative samples for predicting errors in the training of the Pfam scoring model, wherein the positive samples belong to a BGC data set, the error negative samples belong to a non-BGC data set, and the enhanced negative samples are generated based on the negative samples in the non-BGC data set.

11. The method of claim 10, wherein the method comprises:

obtaining the negative sample from the non-BGC dataset;

and replacing the Pfam structure domain in the negative sample based on the similar relation of the Pfam structure domain to obtain the enhanced negative sample.

12. An apparatus for predicting potential BGC in a genomic sequence, the apparatus comprising:

the first prediction module is used for performing structural domain prediction on each gene in a genome sequence to obtain a protein family database Pfam structural domain contained in each gene;

a scoring module for determining a Pfam score for each of the Pfam domains, the Pfam score being indicative of a probability that the Pfam domain belongs to a biosynthetic gene cluster, BGC;

a first determination module for determining a candidate BGC in the genomic sequence based on the Pfam score for each of the Pfam domains, the candidate BGC consisting of at least one gene;

and the second determining module is used for conducting BGC category prediction on the candidate BGCs and determining potential BGCs in the candidate BGCs based on category prediction results.

13. A computer device comprising a processor and a memory, wherein at least one instruction is stored in the memory and loaded and executed by the processor to implement a method of predicting potential BGCs in a genomic sequence as claimed in any one of claims 1 to 11.

14. A computer-readable storage medium having stored thereon at least one instruction which is loaded and executed by a processor to implement a method for predicting potential BGCs in a genomic sequence as claimed in any one of claims 1 to 11.

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to a method, a device, equipment and a medium for predicting potential BGC in a genome sequence.

Background

The Biosynthesis Gene Cluster (BGC) refers to a group of genes with biosynthesis function, which can encode and synthesize secondary metabolites (small molecule compounds), and the secondary metabolites of microorganisms are important sources for drug development.

In the related art, drug developers use a machine learning method to detect the genome sequence of bacteria or fungi, so as to discover potential BGCs related to small molecule compounds with novel structures. In the subsequent research and development process, a targeted experiment can be performed based on the found potential BGCs.

However, when the machine learning method is used for BGC prediction at present, the false positive rate of the BGC prediction result is high, that is, the BGC prediction result contains a large amount of non-BGCs, which is not beneficial to the development of subsequent drugs.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a medium for predicting potential BGC in a genome sequence, which can reduce the false positive rate of BGC prediction and improve the accuracy of BGC prediction. The technical scheme is as follows:

in one aspect, the present embodiments provide a method for predicting potential BGCs in a genomic sequence, the method including:

performing domain prediction on each gene in the genome sequence to obtain a protein family database (Pfam) domain contained in each gene;

determining a Pfam score for each of the Pfam domains, the Pfam score characterizing the probability that the Pfam domain belongs to a BGC;

determining a candidate BGC in the genomic sequence based on the Pfam score for each of the Pfam domains, the candidate BGC consisting of at least one gene;

and performing BGC category prediction on the candidate BGCs, and determining potential BGCs in the candidate BGCs based on category prediction results.

In another aspect, the present embodiments provide an apparatus for predicting potential BGCs in a genomic sequence, the apparatus comprising:

the first prediction module is used for carrying out structural domain prediction on each gene in the genome sequence to obtain a Pfam structural domain contained in each gene;

a scoring module for determining a Pfam score for each of the Pfam domains, the Pfam score being indicative of a probability that the Pfam domain belongs to the BGC;

a first determination module for determining a candidate BGC in the genomic sequence based on the Pfam score for each of the Pfam domains, the candidate BGC consisting of at least one gene;

and the second determining module is used for conducting BGC category prediction on the candidate BGCs and determining potential BGCs in the candidate BGCs based on category prediction results.

In another aspect, the present application provides a computer device, which includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method for predicting potential BGCs in a genome sequence according to the above aspect.

In another aspect, the present application provides a computer-readable storage medium, where at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the method for predicting potential BGC in a genomic sequence according to the above aspect.

In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to make the computer device execute the method for predicting potential BGCs in genome sequences provided by the above aspects.

In the embodiment of the application, firstly, a Pfam score representing the probability that the Pfam domain belongs to BGC is obtained by scoring the Pfam domain contained in each gene in a genome sequence, so that candidate BGCs in the genome sequence are determined according to the Pfam score, then category prediction is further performed on the candidate BGCs, and finally potential BGCs are determined from the candidate BGCs; the scheme provided by the embodiment of the application adopts a dual serial prediction mechanism, first-stage filtering of BGC is realized according to Pfam score, and then second-stage filtering of BGC is realized through category prediction on the basis of a first-stage filtering result, so that the false positive rate of a BGC prediction result is reduced, and the accuracy of BGC prediction is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a BGC prediction process shown in an exemplary embodiment of the present application;

FIG. 2 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

fig. 3 is a flowchart of a method for predicting potential BGCs in a genomic sequence provided in an exemplary embodiment of the present application;

fig. 4 is a flowchart of a method for predicting potential BGCs in a genomic sequence provided by another exemplary embodiment of the present application;

FIG. 5 is a schematic diagram illustrating the structure of a Pfam scoring model in accordance with an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of an implementation of a gene merging process shown in an exemplary embodiment of the present application;

FIG. 7 is a flowchart illustrating a Pfam scoring model training process, according to an exemplary embodiment of the present application;

FIG. 8 is a flowchart illustrating a Pfam score calculation process, shown in an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram illustrating an implementation of a sliding window mechanism in accordance with an illustrative embodiment of the present application;

FIG. 10 is an implementation diagram of a dual model serial prediction process in accordance with an exemplary embodiment of the present application;

FIGS. 11 and 12 are graphs of the results of model performance verification experiments;

fig. 13 to 15 are schematic distribution diagrams of predicted BGCs and actual BGCs under different schemes;

FIG. 16 illustrates a schematic structural diagram of a computer device provided in an exemplary embodiment of the present application;

fig. 17 is a block diagram of a device for predicting potential BGCs in a genome sequence according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

The scheme provided by the embodiment of the application, namely the application of machine learning in the medical field, is used for analyzing the genome sequence by a machine learning method, screening out potential BGCs in the genome sequence, and facilitating the subsequent drug research and development based on the screened out potential BGCs.

In order to reduce the predicted false positive rate of the potential BGC, the scheme provided by the embodiment of the present application adopts a dual serial prediction mechanism. As shown in FIG. 1, under this mechanism, a computer device first performs gene prediction on a genome sequence 101 to obtain a plurality of genes 102 (arrow structures in the figure), and then performs Pfam domain prediction on the genes 102 to obtain Pfam domains 103 (patterns in the arrow in the figure) contained in the genes 102. Further, the computer device scores the individual Pfam domains 103 to obtain Pfam scores 104, thereby determining candidate BGCs 105 in the genome sequence 101 based on the Pfam scores 104 (black arrow structure in the figure). To this end, a computer device implements a first re-prediction of BGC.

Based on the first re-prediction result, the computer device further performs BGC category prediction on the screened candidate BGCs 105 to obtain BGC categories 106 corresponding to each candidate BGC105 (different filling backgrounds in the figure correspond to different BGC categories), and further filters non-BGCs in the candidate BGCs 105 based on the BGC categories 106 to finally obtain potential BGCs 107 in the genome sequence 101. By this time, the computer device completes the serial second prediction and finally obtains the potential BGC.

FIG. 2 illustrates a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application. The implementation environment includes a terminal 210 and a server 220. The data communication between the terminal 210 and the server 220 is performed through a communication network, optionally, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.

The terminal 210 is an electronic device with BGC prediction requirement, and the electronic device may be a smart phone, a tablet computer, a personal computer, or the like, and the embodiment is not limited thereto. In fig. 2, a personal computer used by the terminal 210 as a drug developer is taken as an example for explanation.

In some embodiments, when BGC prediction is desired for a microorganism, a drug developer sequences the genome of the microorganism to obtain the genome sequence of the microorganism, thereby predicting potential BGC based on the genome sequence.

The server 220 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Optionally, the server 220 is configured to provide a BGC prediction service for the terminal 210, and the server 220 performs BGC prediction through a dual serial prediction mechanism. In some embodiments, the server 220 is provided with a pre-trained Pfam scoring model and a Random Forest (Random Forest) classifier, which are pre-trained based on a sample data set. The Pfam scoring model is used for scoring a Pfam structure domain in a gene to obtain a Pfam score for first-weight BGC prediction; and the random forest classifier performs category prediction on the candidate BGCs screened after the first BGC prediction, so that non-BGCs in the candidate BGCs are filtered according to a category prediction result to obtain potential BGCs in a genome sequence, and the serial second BGC prediction is completed.

Illustratively, as shown in fig. 2, after receiving a genome sequence uploaded by the terminal 210, the server 220 performs domain prediction on each gene 211 in the genome sequence to obtain a Pfam domain 222, scores the Pfam domain 222 by using a Pfam scoring model to obtain a Pfam score 223 corresponding to the Pfam domain 222, and screens out a candidate BGC224 based on the Pfam score 223. Further, the server 220 performs category prediction on the candidate BGCs 224 by using a random forest classifier to obtain BGC categories 225 of each candidate BGC224, and finally filters the candidate BGCs 224 based on the BGC categories 225 to determine potential BGCs 226 in the genome sequence and feed the potential BGCs 226 back to the terminal 220.

In other possible embodiments, the dual serial prediction mechanism may also be deployed at the terminal side, and BGC prediction is performed locally on the input genome sequence by the terminal 210 without the aid of the server 220. Optionally, when the terminal 210 locally implements BGC prediction, a BGC prediction application is installed, and the BGC prediction application is provided with a Pfam scoring model and a random forest classifier obtained through preselected training.

For convenience of description, the following embodiments are described as examples in which the method for predicting potential BGCs in a genomic sequence is performed by a computer device.

Fig. 3 shows a flowchart of a method for predicting potential BGCs in a genomic sequence provided in an exemplary embodiment of the present application. The embodiment is described by taking the method as an example for a computer device, and the method comprises the following steps.

Step 301, performing domain prediction on each gene in the genome sequence to obtain a Pfam domain contained in each gene.

In some embodiments, the genomic sequence is obtained by gene sequencing, the genomic sequence consisting of A, C, G, T four letters representing the four nucleotides (adenine, cytosine, guanine, thymine) that make up DeoxyriboNucleic Acid (DNA).

Before performing domain prediction, a computer device first needs to perform gene prediction on a genome sequence to obtain a plurality of genes, wherein the computer device may perform gene prediction by using Prodigal, and the embodiment of the present application does not limit the specific manner of gene prediction.

Pfam is used as a database of protein families and functional domains, and comprises annotations of the protein families and multi-sequence comparison results which are established by a hidden Markov model and have the same annotation result. Protein molecules contain a plurality of structurally specific and functionally distinct regions, called domains (domains), which can be regarded as basic units of protein function, and the function of a protein is determined by the presence of a plurality of domains. The study of domain can better study the function of protein. In this database, the following 2 different levels of protein family information are provided.

1. family: each family is uniquely identified by PF number (e.g., PF00001), and Pfam summary information describes the functional information that the Pfam functional domain performs.

2. And (c) clan: similarity analysis is carried out on a plurality of classes, classes with similar three-dimensional structures or identical sequence modules are classified into a clan, which can be regarded as a super class concept, and each clan is identified by a CL number (such as CL 0063).

For each gene in the genome sequence, the computer device performs Pfam domain prediction on the gene to obtain a Pfam domain contained in each gene. In some embodiments, the computer device uses hmmscan to perform domain prediction on a gene, and the embodiments of the present application are not limited to the specific manner of domain prediction.

Step 302, determining a Pfam score of each Pfam domain, wherein the Pfam score is used for representing the probability that the Pfam domain belongs to the BGC.

Since whether a gene belongs to BGC is closely related to the Pfam domain contained therein, in the embodiment of the present application, the computer device obtains the Pfam score corresponding to each Pfam domain by predicting the probability that the Pfam domain belongs to BGC, wherein the higher the Pfam score is, the higher the probability that the Pfam domain belongs to BGC is.

In one possible implementation mode, the computer device learns the characteristics of the Pfam structure domain in the known BGC and the known non-BGC in a machine learning mode, and then scores the Pfam structure domain according to the learned characteristics in the actual BGC prediction process.

Step 303, determining candidate BGCs in the genome sequence based on the Pfam scores of the Pfam domains, wherein the candidate BGCs are composed of at least one gene.

In one possible embodiment, the computer device first determines candidate genes from the genomic sequence based on the Pfam score of the Pfam domain, and then determines candidate BGCs based on the candidate genes, each candidate BGC being composed of at least one continuous gene.

In some embodiments, the computer device determines the candidate genes based on the gene scores because a higher Pfam score indicates a higher probability that the Pfam domain belongs to the BGC, wherein the gene scores of the candidate genes belong to the gene scores of the non-candidate genes.

Through the above steps, the computer device completes the first re-prediction of the BGCs, and in order to further improve the prediction accuracy, when the candidate BGCs exist, the computer device performs a second re-prediction based on the candidate BGCs, through the following step 304, the purpose of the second re-prediction is to identify and filter non-BGCs in the candidate BGCs.

And step 304, performing BGC category prediction on the candidate BGCs, and determining potential BGCs in the candidate BGCs based on category prediction results.

Compared with the first re-prediction, the second re-prediction is not directed to all genes in the genome sequence, but only to the candidate BGCs obtained by the first re-prediction; in addition, from the viewpoint of the classification of the prediction results, the classification of the prediction result category of the second re-prediction is more detailed (the prediction result category of the first re-prediction includes only two categories, namely, BGC and not BGC).

In some embodiments, the category prediction result obtained by performing BGC category prediction on the candidate BGCs includes at least three categories including non-BGC categories and at least two BGC categories (fine category categories).

Optionally, when the category prediction result indicates that the candidate BGC belongs to a non-BGC, the computer device filters the candidate BGC to reduce the false positive rate of the BGC prediction result; when the category prediction result indicates that the candidate BGC belongs to a BGC, the computer device determines the candidate BGC as a potential BGC.

To sum up, in the embodiment of the application, firstly, the Pfam domain contained in each gene in the genome sequence is scored to obtain the Pfam score representing the probability that the Pfam domain belongs to the BGC, so that candidate BGCs in the genome sequence are determined according to the Pfam score, then category prediction is further performed on the candidate BGCs, and finally potential BGCs are determined from the candidate BGCs; the scheme provided by the embodiment of the application adopts a dual serial prediction mechanism, first-stage filtering of BGC is realized according to Pfam score, and then second-stage filtering of BGC is realized through category prediction on the basis of a first-stage filtering result, so that the false positive rate of a BGC prediction result is reduced, and the accuracy of BGC prediction is improved.

In one possible implementation, the scoring of the Pfam structural domain and the category prediction of the candidate BGC adopt a machine learning technology, wherein the Pfam score is obtained by scoring the Pfam structural domain through a Pfam scoring model (Deep-BGCpred) by a computer device, and the category prediction result of the candidate BGC is output by the computer device through a random forest classifier, that is, the computer device adopts a dual-model serial strategy to perform BGC prediction. The following description will be made using exemplary embodiments.

Fig. 4 shows a flowchart of a method for predicting potential BGCs in a genomic sequence provided in another exemplary embodiment of the present application. The embodiment is described by taking the method as an example for a computer device, and the method comprises the following steps.

Step 401, performing domain prediction on each gene in the genome sequence to obtain a Pfam domain contained in each gene.

Step 301 may be referred to in the implementation manner of this step, and this embodiment is not described herein again.

Step 402, obtaining biological information of the Pfam structural domain, wherein the biological information comprises structural domain information, family description information and family identification.

In order to score the Pfam domain based on more dimensional information to improve the accuracy of the obtained Pfam score, in the embodiment of the present application, the computer device uses the multidimensional biological information related to the Pfam domain as a scoring basis. In some embodiments, the biological information of the Pfam domain comprises domain information, family description information (Pfam summary information), and family identification of the family (clan) to which it belongs.

In an illustrative example, the computer device obtains that the domain information of the Pfam domain is "PF 00001", the family description information is "PF 00001:7transmembrane receiver (family)", and the family identifier is "CL 0192".

Of course, in addition to the above-mentioned biological information, the computer device may also use other biological features related to the Pfam domain as input of the Pfam scoring model, or use only part of the above-mentioned biological information as input of the model (for example, use only the domain information and the family description information as input of the model), which is not limited by the embodiment.

And 403, inputting the biological information into a Pfam scoring model to obtain a Pfam score output by the Pfam scoring model, wherein the Pfam scoring model is obtained by training based on a sample genome sequence containing a BGC label.

Further, the computer device inputs the biological information into a Pfam scoring model, and the Pfam scoring model scores the Pfam structural domain based on the biological information to obtain a Pfam score. Due to differences in the content form of the input biological information, the biological information needs to be processed in the Pfam scoring model.

In one possible design, the Pfam scoring model is composed of an input (input) layer, an embedding (embedding) layer, an encoding (encoding) layer, a connection layer, a feature extraction layer, a pooling (Pooling) layer, and a fully connected layer (or called dense layer). The embedded layer is used for embedding input biological information to obtain an embedded vector corresponding to the biological information; the coding layer is used for coding the embedded vector into a coding vector; and the connection layer is used for connecting the coding vectors corresponding to the biological information with different dimensions to obtain a target vector for reasoning and inputting the target vector into the feature extraction layer. Accordingly, the process of scoring the Pfam domain using the above-structured Pfam scoring model may include the following steps.

1. Biological information is processed through an embedding layer, a coding layer and a connecting layer of a Pfam scoring model to obtain a target vector of a Pfam structure domain, the embedding layer is used for embedding the biological information to obtain an embedded vector, the coding layer is used for coding the embedded vector to obtain a coding vector, and the connecting layer is used for connecting the coding vector to obtain the target vector.

In a possible implementation manner, for biological information of different dimensions, the computer device performs embedding processing on the information through corresponding embedding units in the embedding layer to obtain corresponding embedding vectors.

Schematically, as shown in fig. 5, the computer apparatus embeds domain information into a Pfam sequence of a 102-dimensional vector, which is composed of 100-dimensional Pfam2vec (Pfamtovector) embedding and two binary marks located at the beginning and end of a protein, by a first embedding unit 511; the computer apparatus embeds each character in the family description information (consisting of 64 characters, filled with characters at the end if there are less than 64 characters) into a 32-dimensional vector through the second embedding unit 512; the computer device embeds the family identification as a 64-dimensional vector by means of the third embedding unit 513.

In a possible implementation manner, for the embedded vector embedded based on the family description information and the family identifier, the computer device further encodes the embedded vector through the encoding unit to obtain a corresponding encoded vector, so as to better embody the information characteristics of the family description information and the family identifier. Alternatively, the embedded vector corresponding to the domain information is directly used as the encoding vector.

Illustratively, as shown in fig. 5, the computer device inputs the embedding vector output by the second embedding unit 512 into the first encoding unit 521 (a convolutional neural network may be used), resulting in 960-dimensional encoding vector; the computer apparatus inputs the embedding vector output from the third embedding unit 513 to the second encoding unit 522, resulting in a 64-dimensional encoding vector.

In an illustrative example, the first coding unit and the second coding unit have the architectural information shown in table one.

Watch 1

Furthermore, the coding vectors corresponding to the biological characteristic information under each dimension are connected at the connection layer to obtain the target vector. In some embodiments, as shown in fig. 5, the connection layer connects 102-dimensional code vectors corresponding to the domain information, 960-dimensional code vectors corresponding to the family description information, and 64-dimensional code vectors corresponding to the family identifier to obtain 1126-dimensional target vectors.

2. And performing feature extraction on the target vector through a feature extraction layer of the Pfam scoring model to obtain the structural domain features of the Pfam structural domain.

In order to improve the feature expression capability of the domain features extracted by the feature extraction network and further improve the accuracy of subsequent scoring, in the embodiment of the application, the feature extraction network adopts a staged Bi-LSTM, and the staged Bi-LSTM is composed of a layer of Bi-directional Long-Short Term Memory recurrent neural network (Bi-LSTM) and a layer of unidirectional Long-Short Term Memory recurrent network (LSTM). In one illustrative example, a staged Bi-LSTM contains 128 hidden neural units (hiddenunits) and has a dropout rate of 0.2.

Schematically, as shown in fig. 5, the computer device inputs 1126-dimensional target vectors output by the connection layer into the feature extraction layer, performs feature extraction by the Bi-LSTM 541 and the LSTM 542 in sequence, and finally outputs domain features.

3. And performing pooling and full-connection processing on the structural domain characteristics through a pooling layer and a full-connection layer of the Pfam scoring model to obtain a Pfam score.

In one possible embodiment, the pooling layer of the Pfam scoring model is used to perform time-averaged pooling (temporal mean pooling) of the domain features to integrate node information in the hidden layer. The full-connection layer is composed of a time distribution dense unit (including a sigmoid function) and an output unit, and the numerical value output by the output unit and between 0 and 1 is the Pfam score.

Schematically, as shown in fig. 5, after the computer device performs time-series average pooling on the domain features output by the feature extraction layer, the pooling result is input into the full-link layer, the time distribution intensive unit 561 performs full-link processing on the pooling result, and finally the Pfam score is output through the output unit 562.

In step 404, the mean of the Pfam scores of the individual Pfam domains in the same gene is determined as the gene score of the gene.

Through the above steps, the computer device obtains a Pfam score for each Pfam domain, and since BGCs are composed of genes, the computer device further determines a gene score characterizing the probability that a gene belongs to BGCs based on the Pfam score.

In one possible embodiment, for each gene, the computer device determines the mean of the Pfam scores of the Pfam domains comprised by the gene as the gene score, wherein a higher gene score indicates a higher probability that the gene belongs to the BGC.

In one illustrative example, a gene contains 5 Pfam domains and the corresponding Pfam scores are 0.3, 0.9, 0.96, 0.94, and 0.89, respectively, such that the gene score for the gene is 0.798.

And 405, merging the genes with the gene scores higher than the score threshold value based on a merging rule to obtain merged genes.

The computer device detects whether the gene score of the gene is higher than a score threshold value, and if the gene score is higher than the score threshold value, the gene is determined as a candidate gene; if the score is lower than the threshold value, the gene is determined not to belong to the candidate gene. For example, the score threshold may be 0.7, and the score threshold is not limited in the embodiment of the present application.

And, when there are at least two candidate genes that are consecutive, the computer device merges the consecutive at least two candidate genes to obtain a merged gene.

Schematically, as shown in fig. 6, the computer device calculates gene scores of the first gene 61, the second gene 62, the third gene 63, the fourth gene 64, the fifth gene 65, and the sixth gene 66, respectively, and determines the first gene 61, the fourth gene 64, and the fifth gene 65 as candidate genes. Since the fourth gene 64 and the fifth gene 65 are continuous, the computer device merges the fourth gene 64 and the fifth gene 65 to obtain a merged gene 67.

If there is no adjacent candidate gene, the computer device determines the individual candidate gene as the combined gene.

In response to the number of nucleotides in the pooled gene being greater than the number threshold and the pooled gene not comprising a filtering domain, the pooled gene is determined to be a candidate BGC, the filtering domain being a region known not to comprise a BGC, step 406.

Further, after gene merging is completed, the computer device filters the merged genes which do not meet the requirements based on the post-processing criterion to obtain the candidate BGCs. In some embodiments, the post-processing criteria may include: 1. filtering pooled genes having a number of nucleotides less than a number threshold; 2. regions known to not contain BGC are filtered.

In one illustrative example, the computer device sets the quantity threshold to 2000 and determines 133 regions known to not contain BGCs issued by anti smash and clusterf as filtering domains.

Optionally, when determining the candidate BGCs, the computer device may further merge merged genes separated by at most one gene, so as to obtain the candidate BGCs.

Step 407, conducting BGC category prediction on the candidate BGCs through a random forest classifier to obtain a category prediction result, wherein the classifier categories of the random forest classifier comprise BGC categories and non-BGC categories.

In the steps, the Pfam structure domain is scored by using a Pfam scoring model, only BGC and non-BGC in the genome sequence can be identified, and the false positive rate is high. In order to further improve the prediction accuracy, the computer device further performs BGC category prediction on the screened candidate BGCs by using a random forest classifier obtained by pre-training, and determines whether non-BGCs exist in the candidate BGCs.

Optionally, the random forest classifier is used for classifying specific categories of BGCs in addition to BGCs and non-BGCs. In an illustrative example, the classifier classes (class 8) of the random forest classifier, and the number of training samples used to train the random forest classifier are shown in table two.

Watch two

Numbering Categories Number of training samples
1 Alkaloid (Alkaloid) 54
2 NRP (recombinant protein) 603
3 Other (others) 247
4 Polyketide (polyketone) 849
5 RiPP (ribosome synthesis and post-translational modification peptide) 261
6 Saccharide (saccharified material) 187
7 Terpene (Sonene) 167
8 Non _ BGC (Non BGC) 2102

The classifier classes are only used for illustration, and the number and specific types of the classifier classes are not limited.

And the input different from the Pfam scoring model is biological information, and the input of the random forest classifier is statistical information of a Pfam structure domain in the candidate BGC. In a possible implementation, the process of the random forest classifier performing category prediction on the candidate BGCs may include the following steps:

1. and generating a structural domain statistical matrix based on the statistical information of the Pfam structural domain in the candidate BGC.

Optionally, the statistical information includes the frequency of occurrence of the Pfam domain in the candidate BGC, and correspondingly, different columns in the domain statistical matrix correspond to different Pfam domains, and the frequency of occurrence of the Pfam domains in the matrix is not higher than the frequency of occurrence of the Pfam domains.

Optionally, the random forest classifier is obtained by training based on a sample structure domain statistical matrix corresponding to a training sample, where the training sample includes a specific class label.

2. And inputting the structural domain statistical matrix into a random forest classifier to perform BGC class prediction to obtain a class prediction result.

And the computer equipment inputs the structural domain statistical matrix as a classifier of the random forest classifier so as to obtain an output class prediction result, wherein when the candidate BGC belongs to the BGC, the class prediction result comprises BGC class identification, and when the candidate BGC belongs to the BGC, the class prediction result comprises non-BGC class identification.

Step 408, in response to the category prediction result containing the BGC category identifier, determining the candidate BGC as a potential BGC.

And the computer equipment detects a category prediction result, and if the category prediction result contains BGC list identification, the candidate BGC is determined to be potential BGC. For example, when the category prediction result includes "NRP", the computer device determines that the candidate BGC is a potential BGC whose category is "NRP".

Step 409, responding to the non-BGC category identification contained in the category prediction result, and filtering the candidate BGC.

If the category prediction result contains the non-BGC category identification, the computer device determines that the candidate BGC does not belong to the potential BGC, and therefore the candidate BGC is filtered. For example, when the category prediction result includes "Non BGC", the computer device filters the candidate BGC.

In the embodiment, serial BGC prediction is realized by computer equipment by adopting a Pfam scoring model and a random forest classifier, so that the false positive rate of predicted potential BGC is reduced. When the computer device scores the Pfam structure domain by using the Pfam scoring model, the multi-dimensional biological information (structure domain information, family description information and family identification) related to the Pfam structure domain is used as a scoring basis, so that the accuracy of scoring is improved.

In addition, after the computer device determines the candidate genes based on the Pfam score and merges the candidate genes, the computer device filters the merged genes based on the number of nucleotides and the filtering structural domain, and the false positive rate of the candidate BGCs after primary filtering is reduced.

In order to simulate a real genome environment and thus improve the quality of the trained Pfam scoring model, as shown in fig. 7, the training process of the Pfam scoring model includes the following steps.

Step 701, constructing a sample genome sequence, wherein a sample gene combination sequence is obtained by splicing a positive sample and a negative sample, the positive sample belongs to a BGC data set, and the negative sample belongs to a non-BGC data set.

In some embodiments, the computer device extracts positive samples and negative samples from the BGC data set (containing known BGCs) and the non-BGC data set (containing known non-BGCs), respectively, splices the positive samples and the negative samples, and simulates the situation that the BGCs are randomly distributed in the whole genome sequence and surrounded by the non-BGCs in a real environment to obtain a sample genome sequence.

Alternatively, the BGC data and non-BGC data sets used by the computer device are shown in table three.

Watch III

And step 702, scoring each Pfam structure domain in the sample genome sequence through a Pfam scoring model to obtain a sample Pfam score.

Similar to the application process, the computer device scores each of the Pfam domains in the sample genome sequence through a Pfam scoring model to obtain a sample Pfam score of each of the sample Pfam domains in each of the sample genes.

And step 703, determining sample prediction BGC in the sample genome sequence based on the sample Pfam score.

Similar to the application process, the computer device determines the sample gene score of the sample gene based on the sample Pfam score, and further determines the sample prediction BGC based on the sample gene score, and correspondingly, the sample gene not belonging to the sample prediction BGC belongs to non-BGC.

And step 704, monitoring the positive samples and the negative samples serving as sample prediction BGC, and training a Pfam scoring model.

Since each sample gene constituting the sample genome sequence includes a label (i.e., belongs to BGC or belongs to non-BGC), the computer device can train the Pfam scoring model by using the sample labels of the positive sample and the negative sample as the supervision of sample prediction BGC. The training target of the Pfam scoring model is as follows: the samples determined based on the sample Pfam score predict that BGC is close to positive samples in the sample genome.

In some embodiments, at each training phase, the BGC sequences (positive samples) and the non-BGC sequences (negative samples) are randomly shuffled and then spliced to generate a sample genomic sequence. 256 time steps (timesteps) are configured during training, the batch size (batch size) is 64, optimization is performed by using an Adam optimizer, the learning rate is 1e-4, weighted binary cross entropy is used as a loss function, and the class weight is inversely proportional to the number of positive and negative samples in the training data set (the weight of the positive samples is greater than that of the negative samples).

Because the sample genome sequence constructed in the training process usually contains a specified number of Pfam domains (for example, 256 domains), and the specified number is often much smaller than the number of Pfam domains in the real genome sequence (the real genome sequence usually contains tens of thousands of Pfam domains), the problem of inconsistency between the training scenario and the actual application scenario may be caused, and the accuracy of prediction by using the trained model is further reduced.

To further improve the quality of the prediction, the computer device applies a sliding window mechanism to the actual prediction process, i.e., sequence segments are cut from the genomic sequence through a sliding window, and the Pfam domain in each sequence segment is scored using a Pfam scoring model.

In one possible embodiment, as shown in FIG. 8, the process of scoring the Pfam domain may include the following steps.

Step 801, dividing the genome sequence by adopting a sliding window based on the target number and the target step length to obtain at least two sequence segments, wherein the sequence segments comprise the Pfam structure domains in the target number, and the offset between adjacent sequence segments is the target step length.

The computer device sets the target number of the Pfam structural domains in the sliding window, moves the sliding window according to the target step length, intercepts the sequence segments in the sliding window after each movement, and correspondingly, the position deviation of the starting points (or the end points) of the sequence segments intercepted twice is the target step length.

And when the target step length is less than or equal to the target quantity, the Pfam structural domains contained in the adjacent sequence fragments are not overlapped, and when the target step length is less than the target quantity, the partially overlapped Pfam structural domains exist between the adjacent sequence fragments.

Schematically, as shown in fig. 9, the computer device obtains sequence fragments within sliding windows w1, w2, and w3, respectively, with overlapping Pfam domains between adjacent sequence fragments.

Step 802, determining the Pfam score of each Pfam domain in the sequence fragment by taking the sequence fragment as a unit.

When scoring the Pfam domains, the computer device scores the Pfam domains in each sequence segment. Optionally, the computer device inputs biological information of the Pfam domains in the sequence segment into a Pfam scoring model to obtain a Pfam score of each Pfam domain in the sequence segment.

Schematically, as shown in FIG. 9, each sequence fragment contains 5 Pfam domains (for exemplary purposes only), and there are 3 overlapping Pfam domains between adjacent sequence fragments. And inputting each sequence fragment into a Pfam scoring model by using computer equipment to obtain the Pfam score of each Pfam structure domain in the sequence fragment.

And step 803, in response to the fact that the Pfam domain belongs to at least two sequence fragments, averagely calculating the Pfam score of the Pfam domain in the at least two sequence fragments, and determining the averagely calculated result as the target Pfam score of the Pfam domain.

When the Pfam structure domain belongs to different sequence segments simultaneously, the computer device calculates the average value of the Pfam scores of the Pfam structure in different sequence segments, and determines the calculation result as the target Pfam score of the Pfam structure domain, and the subsequent computer device screens the candidate BGCs based on the target Pfam score.

Schematically, as shown in fig. 9, the 3 rd and 4 th Pfam domains in the genomic sequence belong to both w1 and w2, so the Pfam score of the 3 rd Pfam domain is (0.96+0.91)/2 ═ 0.935, and the Pfam score of the 3 rd Pfam domain is (0.94+0.95)/2 ═ 0.945; the 5 th Pfam domain in the genomic sequence belongs to both w1, w2 and w3, thus the Pfam score of the 5 th Pfam domain is (0.89+0.92+0.91)/3 ═ 0.907; the 6 th and 7 th Pfam domains in the genomic sequence belong to both w2 and w3, so the 6 th Pfam domain has a Pfam score of (0.2+ 0.15)/2-0.175, and the 7 th Pfam domain has a Pfam score of (0.9+ 0.94)/2-0.92.

In the embodiment, a sliding window mechanism is applied in the using process of the model, so that the model application scene is similar to the model training scene, the scoring accuracy of the Pfam scoring model is improved, and the accuracy of the subsequent BGC prediction is improved.

In one illustrative example, a process for serially predicting BGC using a Pfam scoring model and a random forest classifier is shown in FIG. 10. The method comprises the steps that firstly, a plurality of sequence segments are obtained through a sliding window, and a Pfam structure domain in each sequence segment is scored by using a Pfam scoring model, so that a target Pfam score of each Pfam structure domain in a genome sequence is obtained.

Further, the computer device determines a first candidate BGC and a second candidate BGC from the genome sequence based on the target Pfam score, and performs BGC category prediction on the first candidate BGC and the second candidate BGC through a random forest classifier. Since the first BGC candidate is predicted to belong to the NRP and the first BGC candidate belongs to the Non BGC, the computer device finally determines the first BGC candidate as a potential BGC.

In addition, in order for the stochastic forest classifier to identify BGCs that are mispredicted at the time of primary filtering (i.e., to predict non-BGCs as BGCs), the computer device further trains the stochastic forest classifier after completing the Pfam scoring model training. And when the random forest classifier is trained, the adopted training samples comprise positive samples in the BGC data set, negative samples with prediction errors in the Pfam scoring model training process and enhanced negative samples generated based on the negative samples in the non-BGC data set.

The negative samples with wrong prediction in the Pfam scoring model process refer to non-BGCs which are identified as candidate BGCs, and the random forest classifier is trained by utilizing the negative samples, so that the probability of identifying the non-BGCs in the candidate BGCs by the random forest classifier is improved, and the false positive rate of the BGC prediction result is finally output.

Regarding the generation method of the enhanced negative examples, in a possible implementation manner, referring to synonym replacement in the natural language processing process, the computer device acquires the negative examples from the non-BGC data set, and replaces the Pfam domain in the negative examples based on the Pfam domain similarity relationship to obtain the enhanced negative examples.

In one illustrative example, the computer device is based on the Pfam domain similarity network PF00001: { PF05296, PF10320, PF10323, PF10324, PF10328, PF13853}, replacing the Pfam domain "PF 00001" in non-BGC with "PF 10324", resulted in a new non-BGC (i.e. enhanced negative sample).

In some embodiments, the probability of the Pfam domain being replaced in the negative sample is max (2/negative sample length, 0.02), such that there are at least two Pfam domains replaced in the negative sample.

Optionally, a random forest classifier is trained based on the positive samples, the enhanced negative samples and the negative samples with wrong predictions in the process of training the Pfam scoring model, and the computer device trains the random forest classifier by taking class labels of the samples as supervision of output results of the random forest classifier.

In order to verify the effect of the above scheme on improving the BGC prediction accuracy, the performance of each model was tested on reference genome sequences of 12 BGC-annotated real strains (256 BGC annotation information included in 12 reference genome sequences), and the test results are shown in fig. 11 and 12.

FIG. 11 shows the results of the ROC curve, and it can be seen that the area under the curve (AUC) at the level of the Pfam domain is the largest and the performance is the best with the approach provided by the examples of the present application. The Precision-Recall curve shown in FIG. 12 can better reflect the prediction capability of the model under the condition of class imbalance, and it can be seen that the scheme provided by the embodiment of the application has obvious advantages compared with other methods.

Subsequently, by setting a threshold value (threshold value of 0.9) at the Pfam domain level, the performance of the model was evaluated using three evaluation indices of Precision, Recall, and F1, and the evaluation results obtained are shown in table four. It can be seen that the solution provided by the embodiments of the present application still has significant advantages over other approaches.

Watch four

Numbering Model (model) Precision Recall F1
1 clusterfinder_original 19.71% 81.19% 31.71%
2 clusterfinder_retrained 35.30% 77.97% 48.60%
3 DeepBGC 49.65% 77.83% 60.63%
4 Deep-BGCpred (application) 55.50% 80.23% 65.62%

As shown in fig. 13 to 15, the BGC prediction was performed on 12 real strains, and the resulting predicted BGCs and arrangement of real BGCs in the genome sequence were shown. The abscissa in the figure represents the genome coordinate, and the ordinate represents the protocol used for BGC prediction. anti SMASH 6.0 and Prism4 are rule-based methods, the rest are machine learning-based methods. It can be seen from the figure that clusterfinder predicts the largest number of BGCs, but the false positive rate is high. The false positive rate of Prism4 is the lowest, but the number of predicted BGCs is the least, and a large number of unpredictable real BGCs exist. Compared with anti MASH, the BGC prediction is carried out by adopting the scheme provided by the embodiment of the application, BGC which cannot be predicted by a rule-based method can be predicted, unknown brand-new BGC is more likely to be found, and the false positive rate is obviously lower than that of other machine learning schemes.

Referring to fig. 16, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. Specifically, the method comprises the following steps: the computer device 1300 includes a Central Processing Unit (CPU) 1301, a system memory 1304 including a random access memory 1302 and a read only memory 1303, and a system bus 1305 connecting the system memory 1304 and the CPU 1301. The computer device 1300 also includes a basic Input/Output system (I/O system) 1306, which facilitates information transfer between devices within the computer, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.

The basic input/output system 1306 includes a display 1308 for displaying information and an input device 1309, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1308 and input device 1309 are connected to the central processing unit 1301 through an input-output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown), such as a hard disk or drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes Random Access Memory (RAM), Read Only Memory (ROM), flash Memory or other solid state Memory technology, Compact disk Read-Only Memory (CD-ROM), Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1304 and mass storage device 1307 described above may be collectively referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1301, the one or more programs containing instructions for implementing the methods described above, and the central processing unit 1301 executes the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the present application, the computer device 1300 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1300 may be connected to the network 1312 through the network interface unit 1311, which is connected to the system bus 1305, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1311.

The memory also includes one or more programs, stored in the memory, that include instructions for performing the steps performed by the computer device in the methods provided by the embodiments of the present application.

Fig. 17 is a block diagram of an apparatus for predicting potential BGCs in a genomic sequence according to an exemplary embodiment of the present application, the apparatus including:

a first prediction module 1701 for performing domain prediction on each gene in the genome sequence to obtain a protein family database Pfam domain contained in each gene;

a scoring module 1702 for determining a Pfam score for each of the Pfam domains, the Pfam score being indicative of a probability that the Pfam domain belongs to the biosynthetic gene cluster BGC;

a first determining module 1703 for determining a candidate BGC in the genomic sequence based on the Pfam score of each of the Pfam domains, the candidate BGC consisting of at least one gene;

a second determining module 1704, configured to perform BGC category prediction on the candidate BGCs, and determine a potential BGC of the candidate BGCs based on a category prediction result.

Optionally, the scoring module 1702 includes:

an information acquisition unit, configured to acquire biological information of the Pfam domain, where the biological information includes domain information, family description information, and a family identifier;

and the scoring unit is used for inputting the biological information into a Pfam scoring model to obtain the Pfam score output by the Pfam scoring model, and the Pfam scoring model is obtained based on sample genome sequence training containing BGC marks.

Optionally, the scoring unit is specifically configured to:

processing the biological information through an embedding layer, a coding layer and a connecting layer of the Pfam scoring model to obtain a target vector of the Pfam structure domain, wherein the embedding layer is used for embedding the biological information to obtain an embedded vector, the coding layer is used for coding the embedded vector to obtain a coding vector, and the connecting layer is used for connecting the coding vector to obtain the target vector;

performing feature extraction on the target vector through a feature extraction layer of the Pfam scoring model to obtain structural domain features of the Pfam structural domain;

and performing pooling and full-connection processing on the structural domain features through a pooling layer and a full-connection layer of the Pfam scoring model to obtain the Pfam score.

Optionally, the feature extraction layer is composed of a Bi-directional long-short term memory recurrent neural network Bi-LSTM and a unidirectional long-short term memory recurrent network LSTM, and the pooling layer is configured to perform time-sequence average pooling on the domain features.

Optionally, the apparatus further comprises a first training module, configured to:

constructing a sample genome sequence, wherein the sample gene combination sequence is obtained by splicing a positive sample and a negative sample, the positive sample belongs to a BGC data set, and the negative sample belongs to a non-BGC data set;

scoring each Pfam structure domain in the sample genome sequence through the Pfam scoring model to obtain a sample Pfam score;

determining a sample predicted BGC in the sample genomic sequence based on the sample Pfam score;

and taking the positive sample and the negative sample as supervision of sample prediction BGC, and training the Pfam scoring model.

Optionally, the apparatus further comprises:

the dividing module is used for dividing the genome sequence by adopting a sliding window based on the target quantity and the target step length to obtain at least two sequence segments, wherein the sequence segments comprise the Pfam structure domains of the target quantity, and the offset between adjacent sequence segments is the target step length;

the scoring module 1702 is configured to:

determining the Pfam score of each of the Pfam domains in the sequence fragment in units of the sequence fragment;

in response to the fact that the Pfam domain belongs to at least two sequence fragments, performing average calculation on the Pfam scores of the Pfam domains in the at least two sequence fragments, and determining the average calculation result as the target Pfam score of the Pfam domain.

Optionally, the first determining module 1703 includes:

a score determining unit for determining a mean value of the Pfam scores of the respective Pfam domains in the same gene as a gene score of the gene;

the merging unit is used for merging the genes with the gene scores higher than the score threshold value based on a merging rule to obtain merged genes;

a candidate BGC determining module for determining the merger gene as the candidate BGC in response to the number of nucleotides in the merger gene being greater than a number threshold and the merger gene not comprising a filtering domain, the filtering domain being a region known to not comprise BGC.

Optionally, the second determining module 1704 includes:

the category prediction unit is used for conducting BGC category prediction on the candidate BGCs through a random forest classifier to obtain a category prediction result, and the classifier categories of the random forest classifier comprise BGC categories and non-BGC categories;

a potential BGC determining unit, configured to determine, in response to a BGC class identifier included in the class prediction result, that the candidate BGC is the potential BGC;

the device further comprises:

and the filtering module is used for filtering the candidate BGCs in response to the non-BGC category identification contained in the category prediction result.

Optionally, the category prediction unit is configured to:

generating a domain statistical matrix based on the statistical information of the Pfam domain in the candidate BGC;

and inputting the structural domain statistical matrix into the random forest classifier to perform BGC class prediction to obtain the class prediction result.

Optionally, the Pfam score is obtained by scoring the Pfam domain by a Pfam scoring model;

the device further comprises:

and the second training module is used for responding to the completion of the training of the Pfam scoring model, training the random forest classifier based on a positive sample, an enhanced negative sample and a negative sample with prediction errors in the process of training the Pfam scoring model, wherein the positive sample belongs to a BGC data set, the error negative sample belongs to a non-BGC data set, and the enhanced negative sample is generated based on the negative sample in the non-BGC data set.

Optionally, the apparatus includes:

a negative sample acquisition module for acquiring the negative sample from the non-BGC dataset;

and the enhancement module is used for replacing the Pfam structure domain in the negative sample based on the similar relation of the Pfam structure domain to obtain the enhanced negative sample.

To sum up, in the embodiment of the application, firstly, the Pfam domain contained in each gene in the genome sequence is scored to obtain the Pfam score representing the probability that the Pfam domain belongs to the BGC, so that candidate BGCs in the genome sequence are determined according to the Pfam score, then category prediction is further performed on the candidate BGCs, and finally potential BGCs are determined from the candidate BGCs; the scheme provided by the embodiment of the application adopts a dual serial prediction mechanism, first-stage filtering of BGC is realized according to Pfam score, and then second-stage filtering of BGC is realized through category prediction on the basis of a first-stage filtering result, so that the false positive rate of a BGC prediction result is reduced, and the accuracy of BGC prediction is improved.

In the embodiment, serial BGC prediction is realized by computer equipment by adopting a Pfam scoring model and a random forest classifier, so that the false positive rate of predicted potential BGC is reduced. When the computer device scores the Pfam structure domain by using the Pfam scoring model, the multi-dimensional biological information (structure domain information, family description information and family identification) related to the Pfam structure domain is used as a scoring basis, so that the accuracy of scoring is improved.

In addition, after the computer device determines the candidate genes based on the Pfam score and merges the candidate genes, the computer device filters the merged genes based on the number of nucleotides and the filtering structural domain, and the false positive rate of the candidate BGCs after primary filtering is reduced.

It should be noted that: the device provided in the above embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and details of the implementation process are referred to as method embodiments, which are not described herein again.

The present application further provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is loaded and executed by a processor to implement the method for predicting potential BGC in a genome sequence according to any of the above embodiments.

Optionally, the computer-readable storage medium may include: ROM, RAM, Solid State Drives (SSD), or optical disks, etc. The RAM may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM), among others.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to make the computer device execute the method for predicting the potential BGC in the genome sequence according to the above embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is intended to be exemplary only, and not to limit the present application, and any modifications, equivalents, improvements, etc. made within the spirit and scope of the present application are intended to be included therein.

29页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于知识图谱的中药复方靶标预测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!