Molecular virtual screening method based on protein sequence comparison

文档序号:570134 发布日期:2021-05-18 浏览:31次 中文

阅读说明:本技术 一种基于蛋白质序列比对的分子虚拟筛选方法 (Molecular virtual screening method based on protein sequence comparison ) 是由 胡俊 郑琳琳 董世建 白岩松 樊学强 张贵军 于 2020-12-16 设计创作,主要内容包括:一种基于蛋白质序列比对的分子虚拟筛选方法,根据输入的待进行分子筛选的蛋白质序列,使用HHblits程序获取蛋白质的多序列联配信息;计算待预测的蛋白质序列和多序列联配信息对应位置出现相同残基的频率PSFM;使用同样的方法,生成蛋白质-配体相互作用数据库BioLiP中每条蛋白质序列的PSFM;计算待预测蛋白质与BioLiP中每条蛋白质的残基对齐得分与相似度匹配质量,根据匹配质量得分取得潜在种子分子集;计算分子数据库中每个分子与种子分子集中的所有分子的二维指纹图谱值之和,根据得分对DrugBank中所有分子进行排序,取得分靠前的x·N-(DrugBank)个分子为待分子筛选蛋白质序列的分子筛选集。本发明可用于任何筛选场景。(A virtual molecular screening method based on protein sequence alignment comprises the steps of obtaining multiple-sequence association information of proteins by using an HHblits program according to input protein sequences to be subjected to molecular screening; calculating the frequency PSFM of the same residue appearing at the corresponding position of the protein sequence to be predicted and the multi-sequence matching information; using the same method, PSFM of each protein sequence in the protein-ligand interaction database BioLiP was generated; calculating residue alignment scores and similarity matching quality of the protein to be predicted and each protein in the BioLiP, and obtaining a potential seed score set according to the matching quality scores; calculating the sum of two-dimensional fingerprint values of each molecule in the molecule database and all the molecules in the seed molecule set, sequencing all the molecules in the drug Bank according to the scores, and acquiring the x.N in the front of the scores DrugBank Each molecule is a molecular sieve collection of a protein sequence to be screened. The invention can be used in any screening scenario.)

1. A virtual molecular screening method based on protein sequence alignment is characterized by comprising the following steps:

1) inputting a protein sequence P with the residue number L to be subjected to molecular screening;

2) for a protein sequence P, searching a protein sequence database UniRef90 by using a HHblits program to generate multi-sequence alignment information containing M sequences, and recording the multi-sequence alignment information as MSA;

3) for the MSA file, a location-specific frequency matrix of size lx 20 is calculated, written as PSFM:

wherein, PSFMi,jDenotes the ith row and jth column elements in PSFM, i 1,2jIndicates the residue type of the jth residue among the 20 residues (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V),indicates the residue type at the i-th position of the m-th sequence in the MSA,is shown asAnd ResjIf the two are the same, the output is 1, otherwise, the output is 0;

4) generating a PSFM matrix of corresponding proteins for each protein sequence in a protein-ligand interaction database BioLiP according to the steps 2) to 3);

5) according to the PSFM matrix information of the protein sequence P and the PSFM matrix information of each protein sequence T in the BioLiP library, calculating a similar matrix of the P and the T, and recording the similar matrix as S:

wherein S isi,jDenotes the alignment score of the ith residue in P with the jth residue in T, i 1,2T,LTRepresents the number of residues of the protein sequence T;the element in the ith row and kth column of the PSFM matrix representing P,a jth row and kth column element in a PSFM matrix representing T; when the jth residue of T is the ligand molecule binding site,otherwise Indicates the type of the i-th position residue of P,the jth position residue type representing T,is based onAndresidue type replaces the value looked up in the scoring matrix from BLOSUM 62; w is a1And w2The two constants are the weight occupied by the corresponding PSFM matrix and the ligand site respectively; calculating an alignment score for all residue pairs consisting of P and T residues;

6) calculating the alignment information of the residues in the P and the T by using a Needleman-Wunsch dynamic programming algorithm according to the similarity matrix of the P and the T obtained in the step 5), and recording the alignment information asi=1,2,...,LaliWherein L isaliIs the number of pairs of residues in P that align with the residues in T,indicates the position of the residue in P in the i pair of residues in P,indicates the position of the residue in T in the ith pair of residues;

7) calculating the similarity matching quality of the protein sequence P and T, and recording the quality as QLBS

Wherein the content of the first and second substances, denotes the second in PResidue of (a) and the second of TA score for individual residue alignment;

8) calculating the similarity match quality Q of each protein sequence T in the BioLiP with the input protein sequence P according to the steps 5) to 7)LBSAll Q's were selected from BioLiPLBSNot less than 0.5 protein sequence, and selecting the ligand small molecule interacted with the protein sequence from BioLip to obtain component subset labeled asWherein N isTPDIs the number of molecules in the TPD,is the ith molecule in TPD, i ═ 1,2TPD(ii) a Here, each molecule in TPDCan be understood as a potential molecule that can interact with P;

9) for each molecule in TPDi=1,2,...,NTPDGenerating a molecular fingerprint containing 1024 bits by using OpenBabel softwareWherein each bit has a value of 0 or 1;

10) each molecule in the drug Bank of the library of molecules to be screenedj=1,2,...,NDrugBankAlso, OpenBabel software is used to generate a molecular fingerprint containing 1024 bitsWherein N isDrugBankThe total number of molecules in the molecular library DrugBank;

11) calculation of Each molecule in TPDi=1,2,...,NTPDMolecular fingerprint ofAnd each molecule in DrugBankj=1,2,...,NDrugBankMolecular fingerprint ofThe value of similarity therebetween TaniCoeffi,j

Wherein the content of the first and second substances,is composed ofThe value of the k-th position element in (c),is composed ofThe value of the kth position element, k 1, 2., 1024;

12) calculating all values according to step 11), calculating each molecule in DrugBankProbability value VSsco for possible interaction with the input protein sequence Pj

Among them, TaniCoeffi,jDenotes the ith molecule in TPDMolecular fingerprint ofAnd the jth molecule in DrugBankMolecular fingerprint ofA similarity value therebetween;

13)according to VSscojValue, sequence all molecules in drug Bank from high to low, take the top x.NDrugBankReturning the molecules as final virtual screening results; wherein, x is the screening ratio in the drug Bank of the molecular database to be screened, and the value range is 0 to 1.

Technical Field

The invention relates to the fields of bioinformatics and computer application, in particular to a molecular virtual screening method based on protein sequence comparison.

Background

Identifying lead molecules that interact with a given protein and appropriately modify its biological behavior is a fundamental challenge in pharmaceutical research. The virtual screening method simulates the interaction between a target point and a candidate drug by using molecular docking software on a computer, and calculates the affinity between the target point and the candidate drug so as to reduce the number of the actually screened compounds and improve the discovery efficiency of the lead compounds. Therefore, the rapid and accurate virtual screening method has important guiding significance for the design and research and development of drug molecules.

Investigation literature found that many virtual screening methods have been proposed, such as LncLocator (Cao Zhen, Pan Xiaoyong, Yang Yang Yang, Huangg Yan, Shen Hong-Bin. the LncLocator: a subellimator localization predictor for locating non-coding RNAs based on stacked integration classifiers, bioinformatics,2018,34(13): 2185-19: 5-194, Shen Xiaoyong, Yang Yang Yang, Huang Yan, Shen Hong-Bin. the LncLocator: a subcellular localization predictor for long non-coding RNAs based on stacked integration classifiers, 2018,34(13): 5-194, Automation, AustoVis Vis visualization, AustoJ. the simulation predictor for mapping of coding proteins, Australin-coding proteins, and simulation model No. 7. the results of simulation model, found by the simulation model No. 5-5, 3-5-194, AustoC simulation model No. 5, AustoC. the simulation model No. 5. 7. the simulation model No. 7. the results of simulation model No. 3. the simulation model No. 2. the simulation model No. 3. the simulation model, the simulation model No. 7. the simulation model No. 3. A. shows that the results show that the results in the results show that the results in the results show, Efficient optimization and multi-threaded processing improve the speed and accuracy of docking, journal of computational chemistry,2010.31(2): 455-. Although existing methods can be used for virtual screening of drug molecules, there is a general need to know the three-dimensional structure of a given protein or to know at least one binding molecule, so existing virtual screening methods do not work well without the three-dimensional structure of the protein or the binding molecule being unknown.

In summary, the existing molecular virtual screening methods have great differences from the requirements of practical application in the aspects of screening scenes and screening effects, and improvements are urgently needed.

Disclosure of Invention

In order to overcome the defects of the existing molecular virtual screening method in two aspects of screening scenes and screening effects, the invention provides a molecular virtual screening method based on protein sequence comparison, which can be used in any screening scenes.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for virtual screening of molecules based on protein sequence alignment, the method comprising the steps of:

1) inputting a protein sequence P with the residue number L to be subjected to molecular screening;

2) for protein sequence P, using HHblits (https:// toolkit. tuebingen. mpg. de/#/HHblits) program to search database UniRef90(ftp:// ftp. uniprot. org/pub/databases/uniprot/unirref 90/), generating a multi-sequence alignment message containing M sequences, which is recorded as MSA;

3) for the MSA file, a location-specific frequency matrix of size lx 20 is calculated, written as PSFM:

wherein, PSFMi,jDenotes the ith row and jth column elements in PSFM, i 1,2jIndicates the residue type of the jth residue among the 20 residues (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V),indicates the residue type at the i-th position of the m-th sequence in the MSA,is shown asAnd ResjIf the two are the same, the output is 1, otherwise, the output is 0;

4) generating a PSFM matrix of corresponding proteins for each protein sequence in a protein-ligand interaction database BioLiP (http:// biolipip 2018.chem. uoa. gr /) according to steps 2) to 3);

5) according to the PSFM matrix information of the protein sequence P and the PSFM matrix information of each protein sequence T in the BioLiP library, calculating a similar matrix of the P and the T, and recording the similar matrix as S:

wherein S isi,jDenotes the alignment score of the ith residue in P with the jth residue in T, i 1,2T,LTRepresents the number of residues of the protein sequence T;the element in the ith row and kth column of the PSFM matrix representing P,a jth row and kth column element in a PSFM matrix representing T; when the jth residue of T is the ligand molecule binding site,otherwise Indicates the type of the i-th position residue of P,the jth position residue type representing T,is based onAndresidue type replaces the value looked up in the scoring matrix from BLOSUM 62; w is a1And w2The two constants are the weight occupied by the corresponding PSFM matrix and the ligand site respectively; calculating an alignment score for all residue pairs consisting of P and T residues;

6) calculating the alignment information of the residues in the P and the T by using a Needleman-Wunsch dynamic programming algorithm according to the similarity matrix of the P and the T obtained in the step 5), and recording the alignment information asi=1,2,...,LaliWherein L isaliIs the number of pairs of residues in P that align with the residues in T,indicates the position of the residue in P in the i pair of residues in P,indicates the position of the residue in T in the ith pair of residues;

7) calculating the similarity matching quality of the protein sequence T and the protein sequence P, and recording the similarity matching quality as QLBS

Wherein the content of the first and second substances, denotes the second in PResidue of (a) and the second of TA score for individual residue alignment;

8) calculating the similarity match quality Q of each protein sequence T in the BioLiP with the input protein sequence P according to the steps 5) to 7)LBSAll Q's were selected from BioLiPLBSNot less than 0.5 protein sequence, and selecting the ligand small molecule interacted with the protein sequence from BioLip to obtain component subset labeled asWherein N isTPDIs the number of molecules in the TPD,is the ith molecule in TPD, i ═ 1,2TPD(ii) a Here, each molecule in TPDCan be understood as a potential molecule that can interact with P;

9) for each molecule in TPDi=1,2,...,NTPDOpenBabel software (http:// OpenBabel. org/wiki/Main _ Page) is used to generate a molecular fingerprint containing 1024 bitsWherein each bit has a value of 0 or 1;

10) each molecule in the library of molecules to be screened drug Bank (https:// go. drug Bank. com /)j=1,2,...,NDrugBankAlso OpenBabel soft is usedThe device generates a molecular fingerprint containing 1024 bitsWherein N isDrugBankThe total number of molecules in the molecular library DrugBank;

11) calculation of Each molecule in TPDi=1,2,...,NTPDMolecular fingerprint ofAnd each molecule in DrugBankj=1,2,...,NDrugBankMolecular fingerprint ofThe value of similarity therebetween TaniCoeffi,j

Wherein the content of the first and second substances,is composed ofThe value of the k-th position element in (c),is composed ofThe value of the kth position element, k 1, 2., 1024;

12) calculating all values according to step 11), calculating each molecule in DrugBankProbability value VSsco for possible interaction with the input protein sequence Pj

Among them, TaniCoeffi,jDenotes the ith molecule in TPDMolecular fingerprint ofAnd the jth molecule in DrugBankMolecular fingerprint ofA similarity value therebetween;

13) according to VSscojValue, sequence all molecules in drug Bank from high to low, take the top x.NDrugBankReturning the molecules as final virtual screening results; wherein, x is the screening ratio in the drug Bank of the molecular database to be screened, and the value range is 0 to 1.

The technical conception of the invention is as follows: firstly, acquiring multi-sequence association information of proteins by using an HHblits program according to input protein sequences to be subjected to molecular screening; then, calculating the frequency of the same residue at the corresponding position of the protein sequence to be predicted and the multi-sequence matching information, and recording as PSFM; using the same method, PSFM of each protein sequence in the protein-ligand interaction database BioLiP was generated; thirdly, calculating residue alignment scores and similarity matching qualities of the protein to be predicted and each protein in the BioLiP, and obtaining a potential seed score set according to the matching quality scores; finally, calculating the sum of two-dimensional fingerprint values of each molecule in the molecule database and all the molecules in the seed molecule set according toScore all molecules in drug Bank are ranked, top x.N is obtainedDrugBankEach molecule is a molecular sieve collection of a protein sequence to be screened. The invention provides a molecular virtual screening method based on protein sequence comparison, which can be used in any screening scene.

The beneficial effects of the invention are as follows: on one hand, by constructing the potential seed molecular set, the situation that a virtual screening method based on a structure and a virtual screening method based on a ligand cannot work when no structure and molecular combination of protein exists is avoided; on the other hand, similarity scoring and ranking of all molecules in the drug bank, taking into account more molecules that do not bind to proteins, will help in screening potential molecules.

Drawings

FIG. 1 is a schematic diagram of a virtual molecular screening method based on protein sequence alignment.

FIG. 2 shows the results of molecular screening of protein 5FQ9 using a molecular virtual screening method based on protein sequence alignment.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a virtual molecular screening method based on protein sequence alignment includes the following steps:

1) inputting a protein sequence P with the residue number L to be subjected to molecular screening;

2) for protein sequence P, using HHblits (https:// toolkit. tuebingen. mpg. de/#/HHblits) program to search database UniRef90(ftp:// ftp. uniprot. org/pub/databases/uniprot/unirref 90/), generating a multi-sequence alignment message containing M sequences, which is recorded as MSA;

3) for the MSA file, a location-specific frequency matrix of size lx 20 is calculated, written as PSFM:

wherein, PSFMi,jDenotes the ith row and jth column elements in PSFM, i 1,2jIndicates the residue type of the jth residue among the 20 residues (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V),indicates the residue type at the i-th position of the m-th sequence in the MSA,is shown asAnd ResjIf the two are the same, the output is 1, otherwise, the output is 0;

4) generating a PSFM matrix of corresponding proteins for each protein sequence in a protein-ligand interaction database BioLiP (http:// biolipip 2018.chem. uoa. gr /) according to steps 2) to 3);

5) according to the PSFM matrix information of the protein sequence P and the PSFM matrix information of each protein sequence T in the BioLiP library, calculating a similar matrix of the P and the T, and recording the similar matrix as S:

wherein S isi,jDenotes the alignment score of the ith residue in P with the jth residue in T, i 1,2T,LTRepresents the number of residues of the protein sequence T;the element in the ith row and kth column of the PSFM matrix representing P,a jth row and kth column element in a PSFM matrix representing T; when the jth residue of T is the ligand molecule binding site,otherwise Indicates the type of the i-th position residue of P,the jth position residue type representing T,is based onAndresidue type replaces the value looked up in the scoring matrix from BLOSUM 62; w is a1And w2The two constants are the weight occupied by the corresponding PSFM matrix and the ligand site respectively; calculating an alignment score for all residue pairs consisting of P and T residues;

6) calculating the alignment information of the residues in the P and the T by using a Needleman-Wunsch dynamic programming algorithm according to the similarity matrix of the P and the T obtained in the step 5), and recording the alignment information asi=1,2,...,LaliWherein L isaliIs the number of pairs of residues in P that align with the residues in T,indicates the position of the residue in P in the i pair of residues in P,indicates the position of the residue in T in the ith pair of residues;

7) calculating the similarity matching quality of the protein sequence T and the protein sequence P, and recording the similarity matching quality as QLBS

Wherein the content of the first and second substances, denotes the second in PResidue of (a) and the second of TA score for individual residue alignment;

8) calculating the similarity match quality Q of each protein sequence T in the BioLiP with the input protein sequence P according to the steps 5) to 7)LBSAll Q's were selected from BioLiPLBSNot less than 0.5 protein sequence, and selecting the ligand small molecule interacted with the protein sequence from BioLip to obtain component subset labeled asWherein N isTPDIs the number of molecules in the TPD,is the ith molecule in TPD, i ═ 1,2TPD(ii) a Here, each molecule in TPDCan be understood as a potential molecule that can interact with P;

9) for each molecule in TPDi=1,2,...,NTPDOpenBabel software (http:// OpenBabel. org/wiki/Main _ P) was usedage) generates a molecular fingerprint containing 1024 bitsWherein each bit has a value of 0 or 1;

10) each molecule in the library of molecules to be screened drug Bank (https:// go. drug Bank. com /)j=1,2,...,NDrugBankAlso, OpenBabel software is used to generate a molecular fingerprint containing 1024 bitsWherein N isDrugBankThe total number of molecules in the molecular library DrugBank;

11) calculation of Each molecule in TPDi=1,2,...,NTPDMolecular fingerprint ofAnd each molecule in DrugBankj=1,2,...,NDrugBankMolecular fingerprint ofThe value of similarity therebetween TaniCoeffi,j

Wherein the content of the first and second substances,is composed ofThe value of the k-th position element in (c),is composed ofThe value of the kth position element, k 1, 2., 1024;

12) calculating all values according to step 11), calculating each molecule in DrugBankProbability value VSsco for possible interaction with the input protein sequence Pj

Among them, TaniCoeffi,jDenotes the ith molecule in TPDMolecular fingerprint ofAnd the jth molecule in DrugBankMolecular fingerprint ofA similarity value therebetween;

13) according to VSscojValue, sequence all molecules in drug Bank from high to low, take the top x.NDrugBankReturning the molecules as final virtual screening results; wherein, x is the screening ratio in the drug Bank of the molecular database to be screened, and the value range is 0 to 1.

In this embodiment, a virtual molecular screening method based on protein sequence alignment, which takes the virtual molecular screening of protein sequence 5FQ9 as an example, includes the following steps:

1) inputting a protein sequence 5FQ9 with the residue number 249 to be subjected to molecular screening;

2) for the protein sequence 5FQ9, using HHblits (https:// toolkit. tuebingen. mpg. de/#/HHblits) program to search database UniRef90(ftp:// ftp. unidrop. org/pub/databases/unidrop/UniRef/UniRef 90/), generate a multiple sequence alignment containing 381 sequences, which is recorded as MSA;

3) for MSA files, a position-specific frequency matrix of 249 × 20 size was calculated, written as PSFM:

wherein, PSFMi,jDenotes the ith row and jth column elements in PSFM, i 1,2jIndicates the residue type of the jth residue among the 20 residues (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V),indicates the residue type at the i-th position of the m-th sequence in the MSA,is shown asAnd ResjIf the two are the same, the output is 1, otherwise, the output is 0;

4) generating a PSFM matrix of corresponding proteins for each protein sequence in a protein-ligand interaction database BioLiP (http:// biolipip 2018.chem. uoa. gr /) according to steps 2) to 3);

5) based on the PSFM matrix information of the protein sequence 5FQ9 and the PSFM matrix information of each protein sequence T in the BioLiP library, a similarity matrix of 5FQ9 and T was calculated, and was denoted as S:

wherein S isi,jDenotes the alignment score of the i-th residue in 5FQ9 with the j-th residue in T, i 1,2T,LTRepresents the number of residues of the protein sequence T;the element in the ith row and kth column of the PSFM matrix representing 5FQ9,a jth row and kth column element in a PSFM matrix representing T; when the jth residue of T is the ligand molecule binding site,otherwise The ith position residue type of 5FQ9,the jth position residue type representing T,is based onAndresidue type replaces the value looked up in the scoring matrix from BLOSUM 62; w is a1And w2The two constants are the weight occupied by the corresponding PSFM matrix and the ligand site respectively; calculating all residue pair alignment scores consisting of 5FQ9 and residues in T;

6) according to the similarity matrix of the 5FQ9 and T obtained in the step 5), so thatThe alignment information of the residues in 5FQ9 and T was calculated using the Needleman-Wunsch dynamic programming algorithm and is written asi=1,2,...,LaliWherein L isaliIs the number of pairs of residues in which the residue in 5FQ9 aligns with the residue in T,indicates the position of the residue in 5FQ9 in 5FQ9 in the i pair of residues,indicates the position of the residue in T in the ith pair of residues;

7) calculating the similarity matching quality of the protein sequence T and 5FQ9, and recording the quality as QLBS

Wherein the content of the first and second substances, denotes the th in 5FQ9Residue of (a) and the second of TA score for individual residue alignment;

8) calculating the similarity match quality Q of each protein sequence T in BioLiP to the input protein sequence 5FQ9 according to steps 5) to 7)LBSAll Q's were selected from BioLiPLBSNot less than 0.5 protein sequence, and selecting the ligand small molecule interacted with the protein sequence from BioLip to obtain component subset labeled asWherein N isTPDIs the number of molecules in the TPD,is the ith molecule in TPD, i ═ 1,2TPD(ii) a Here, each molecule in TPDCan be understood as a potential molecule that can interact with 5FQ 9;

9) for each molecule in TPDi=1,2,...,NTPDOpenBabel software (http:// OpenBabel. org/wiki/Main _ Page) is used to generate a molecular fingerprint containing 1024 bitsWherein each bit has a value of 0 or 1;

10) each molecule in the library of molecules to be screened drug Bank (https:// go. drug Bank. com /)j=1,2,...,NDrugBankAlso, OpenBabel software is used to generate a molecular fingerprint containing 1024 bitsWherein N isDrugBankThe total number of molecules in the molecular library DrugBank;

11) calculation of Each molecule in TPDi=1,2,...,NTPDMolecular fingerprint ofAnd each molecule in DrugBankj=1,2,...,NDrugBankMolecular fingerprint ofThe value of similarity therebetween TaniCoeffi,j

Wherein the content of the first and second substances,is composed ofThe value of the k-th position element in (c),is composed ofThe value of the kth position element, k 1, 2., 1024;

12) calculating all values according to step 11), calculating each molecule in DrugBankProbability value VSsco for possible interaction with the input protein sequence 5FQ9j

Among them, TaniCoeffi,jDenotes the ith molecule in TPDMolecular fingerprint ofAnd the jth molecule in DrugBankMolecular fingerprint ofA similarity value therebetween;

13) according to VSscojValue, sequence all molecules in drug Bank from high to low, 0.1. N, topDrugBankReturning the molecules as final virtual screening results; wherein x is the screening ratio required from the drug bank of the molecular database to be screened, and is 0.1.

The virtual molecular screening of protein 5FQ9 is shown in FIG. 2, which is an example of the virtual molecular screening of protein 5FQ9 predicted by the above method.

The above description is the prediction result obtained by the molecular virtual screening of the protein 5FQ9 in the present invention, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于几何边界运算的分子动力学边界条件快速施加方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!