HIV subtype classification system and classification method

文档序号：9888 发布日期：2021-09-17 浏览：31次中文

阅读说明：本技术 一种hiv亚型分类系统及分类方法 (HIV subtype classification system and classification method ) 是由于斌姜淼于 2021-06-23 设计创作，主要内容包括：本发明涉及生物信息领域,尤其涉及一种HIV亚型分类系统及分类方法。本发明构建的HIV亚型分类系统包括数据库池、分型模块和数据管理模块,数据库池囊括了已知的所有基因型和基因亚型的HIV序列。数据库管理模块可以定期自动化完成公共数据库数据的下载工作,自动化完成比对数据库构建、扩容与数据库池整合工作。通过引入数据库池和三个分型模块,大大提高对HIV分型工作的准确度和效率。用户只需要输入HIV测序结果,数据库系统就可以自动完成数据标准化、序列分型工作,用户可以继续根据需要将新获得的标准化序列收录至数据库池。(The invention relates to the field of biological information, in particular to an HIV subtype classification system and a classification method. The HIV subtype classification system constructed by the invention comprises a database pool, a classification module and a data management module, wherein the database pool contains known HIV sequences of all genotypes and gene subtypes. The database management module can automatically complete the downloading work of the public database data at regular intervals, and automatically complete the construction and expansion work of the comparison database and the integration work of the database pool. By introducing the database pool and the three typing modules, the accuracy and efficiency of HIV typing work are greatly improved. The user only needs to input the HIV sequencing result, the database system can automatically complete the data standardization and sequence typing work, and the user can continuously record the newly obtained standardized sequence into the database pool according to the requirement.)

An HIV subtype classification system, characterized in that it comprises:

a database pool comprising first-generation sequencing sequences and second-generation sequencing data of HIV from an open public database;

a database management module comprising a database pool construction and integration module and a data update module, wherein,

the database pool constructing and integrating module processes the input second-generation sequencing BAM file into a consistent sequence Reads.fasta, records the HIV sequence subjected to quality check into the database pool, records the newly added sequence of an open public database into the database pool,

the data updating module is used for automatically downloading public database sequences at regular intervals;

a typing module comprising the following typing sub-modules:

the HIV second generation sequencing data typing submodule is used for counting breakpoint coverage conditions of the consistency sequence in the database pool to all HIV subtypes, calculating and comparing breakpoint coverage rates corresponding to different HIV subtypes, comparing the breakpoint coverage rates of a sample to be classified, outputting a typing result of the sample to be detected,

an HIV generation sequencing data typing submodule used for directly blast comparing the consistency sequence of a sample to be typed with an HIV generation sequencing sequence in a database pool and outputting a sequence similarity comparison result,

and the recombination and mixed subtype HIV typing submodule is used for comparing second-generation sequencing reads of a sample to be tested with second-generation sequencing data of HIV in a database pool, counting and comparing the reads proportions of different subtypes, and assisting in judging the recombination and mixed subtypes.

2. The HIV subtype classification system according to claim 1, wherein the HIV next generation sequencing data typing submodule performs the steps of typing:

s1, inputting a sequence: inputting the constructed consistency sequence;

s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;

s3, screening amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;

s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,

s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype,

s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,

when the sample is matched with the simple subtype sequence, comparing the difference value with the simple subtype sequence with the upper limit of the typing breakpoint coverage rate, wherein if the difference value with the simple subtype sequence is less than or equal to the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, if the difference value with the simple subtype sequence is greater than the upper limit of the typing breakpoint coverage rate, the sample is judged to be the simple subtype, and meanwhile, a warning is reported,

and when the best match is a complex subtype sequence, comparing a difference value with the complex subtype sequence with the upper limit of typing breakpoint coverage, wherein if the difference value compared with the complex subtype sequence is less than or equal to the upper limit of typing breakpoint coverage, the complex subtype is judged, if the difference value compared with the complex subtype sequence is greater than the upper limit of typing breakpoint coverage, the best match parent is judged to be scored, wherein if the difference value between the sample best match parent score and the popular recombination subtype difference value is less than or equal to 1%, the parent subtype is reported, otherwise, the unique recombination subtype is judged.

3. The HIV subtype classification system according to claim 1, wherein the recombinant and mixed subtype HIV classification submodule performs the following steps to assist in the determination of recombinant and mixed subtypes:

comparing the second generation sequencing reads of the sample to be detected with the second generation sequencing data of HIV in the database pool, counting and comparing the reads proportions of different subtypes, wherein,

if the typing result of the HIV second-generation sequencing data typing submodule is a non-URF pure subtype, the result of comparing and typing the reads in the recombinant and mixed subtype HIV typing submodule needs to be the same as the result of the HIV second-generation sequencing data typing submodule, and the proportion is not lower than 60%;

if the typing result of the HIV second-generation sequencing data typing submodule is a URF pure subtype, the results of comparing the typing top10 in the recombinant and mixed subtype HIV typing submodule with the typing results of the HIV second-generation sequencing data typing submodule have different parental subtypes, and the proportion of all the results of the top10 in the ranking is not higher than 60%;

if the typing results of the recombinant and mixed subtype HIV typing submodule are mixed subtypes, the results of reads in the recombinant and mixed subtype HIV typing submodule comparing the typing top10 and the typing results of the HIV next generation sequencing data typing submodule have the same parental subtypes, and the proportion of all the results of the top10 is not higher than 60%.

A method for classifying HIV subtypes, the method comprising the steps of:

sequencing a section of sequence spanning PR and RT regions on the pol gene of the HIV sample to be typed;

collecting HIV sequences from a public database and sequence data of a newly added HIV pool region, collecting second-generation sequencing data of HIV, and constructing a database pool;

processing data, namely processing a second-generation sequencing file in a database pool into a consistent sequence Reads.fasta, recording the HIV sequence subjected to quality check into the database pool, and recording a newly added sequence of a public database into the database pool; and

and (3) counting the breakpoint coverage condition of HIV subtypes covered by the converted consistency sequence of the second-generation sequencing data of the HIV in the database pool, calculating the breakpoint coverage rate corresponding to different HIV subtypes, comparing the breakpoint coverage rates of the sample to be detected and the known HIV subtypes, determining the typing of the sample to be detected, and outputting the typing result of the sample to be detected.

5. The HIV subtype classification method according to claim 4, characterized in that it further comprises the steps of: and (3) directly performing blast comparison on the consistent sequence with an HIV generation sequencing sequence in a database pool, and outputting a sequence similarity comparison result.

6. The HIV subtype classification method according to claim 4, further comprising a step of determining recombinant and mixed subtypes of HIV, wherein reads from next generation sequencing of the sample to be tested are compared with the second generation sequencing data of HIV in the database pool, and the ratio of reads to different subtypes is statistically compared,

7. The HIV subtype classification method according to claim 4, characterized in that the classification of the sample to be tested is determined by the following steps:

s1, inputting a sequence: inputting the constructed consistency sequence;

s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;

s3, screening off amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;

s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,

s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype,

s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,

8. The HIV subtype classification method according to claim 4, characterized in that the sequence of 1kb across PR and RT regions on the pol gene of the HIV sample to be typed is sequenced.

9. The HIV subtype classification method according to claim 4, characterized in that the typing results of the samples to be tested and the nucleic acid level similarity of the samples to be tested and the optimal typing results are outputted.

10. The HIV subtype classification method according to claim 4, characterized in that the typing results of the samples to be tested and the similarity of the amino acid levels of the samples to be tested and the optimal typing results are outputted.

Technical Field

The invention relates to the field of biological information, in particular to an HIV subtype classification system and a classification method.

Background

HIV includes subtypes A, B, C, D, F, G, H, J, K, and the overall proportion of recombinant forms continues to increase over time. HIV diversity is complex and evolving, and is a major challenge in HIV vaccine development. Monitoring the global molecular epidemiology of HIV type remains crucial to the design, detection and implementation of HIV vaccines.

HIV typing has guiding significance for the interpretation of drug resistance test results and the formulation of individualized treatment regimens for infected patients. Since the subtype-specific genetic barrier can play a role in the occurrence and development of drug-resistant mutations, or since the influence of other drug-resistant sites on the main drug-resistant site is different, the evolution direction and the evolution speed of different subtypes are influenced. The drug-resistant mutation sites and the frequency thereof are different among different subtypes, new drug-resistant mutation sites are continuously reported, and meanwhile, some unexplained drug sensitivity also influences the explanation of the genotype drug-resistant detection result, so that the subtype specificity of the drug-resistant mutation is evaluated, and the difference in the drug-resistant mutation characteristics has important reference value in designing an ART treatment scheme for patients.

While existing HIV bioinformatics databases have facilitated researchers and medical personnel to carry out relevant work, there are still some difficulties and risks in specifically using these databases, as follows:

1. the existing public databases have scattered information sources, and HIV sequence information of the databases is mostly based on a generation of sequencing results, and the sequence quality cannot be guaranteed.

2. The function of public database genotyping annotation based on the results of HIV next generation sequencing is still in the Beta testing stage, such as HGS-Beta of HIVDB. Meanwhile, the second-generation sequencing annotation tools with only a small number of databases have the function of integrating the own databases, and in addition, the flexibility and the efficiency of integrating the own data by the annotation tools are not high.

3. Most of the existing public database annotation tools adopt a single-thread mode to execute tasks, and are difficult to be competent for mainstream data analysis tasks based on computer cluster calculation and big data.

Disclosure of Invention

In view of the above problems, it is an object of the present invention to provide an HIV subtype classification system.

It is yet another object of the present invention to provide a method for classifying HIV subtypes.

The HIV subtype classification system according to the present invention comprises:

a database pool comprising first-generation sequencing sequences and second-generation sequencing data of HIV from an open public database;

a database management module comprising a database pool construction and integration module and a data update module, wherein,

the database pool constructing and integrating module processes the input second-generation sequencing BAM file into a consistency sequence Reads.fasta; and including the quality checked HIV sequences into the database pool; and recording the newly added sequence of the open public database into the database pool,

the data updating module is used for automatically downloading public database sequences at regular intervals;

a typing module comprising the following typing sub-modules:

The HIV subtype classification system according to the present invention, wherein the HIV secondary sequencing data typing submodule performs typing by:

s1, inputting a sequence: inputting the constructed consistency sequence;

s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;

s3, screening off amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;

s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,

s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype;

s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,

The HIV subtype classification system according to the present invention, wherein the recombinant and mixed subtype HIV classification submodule performs the following steps to assist in the determination of recombinant and mixed subtypes:

The HIV subtype classification method according to the present invention comprises the following steps:

sequencing a section of sequence spanning PR and RT regions on the pol gene of the HIV sample to be typed;

collecting HIV sequences from a public database and sequence data of a newly added HIV pool region, collecting second-generation sequencing data of HIV, and constructing a database pool;

The HIV subtype classification method according to the present invention, wherein said method further comprises the steps of: and (3) directly performing blast comparison on the consistent sequence with an HIV generation sequencing sequence in a database pool, and outputting a sequence similarity comparison result.

The HIV subtype classification method according to the present invention, wherein said method further comprises the step of judging recombinant and mixed subtypes of HIV, wherein second-generation sequencing reads of a sample to be tested are compared with second-generation sequencing data of HIV in a database pool, and the ratio of reads of different subtypes is statistically compared, wherein,

According to the HIV subtype classification method of the present invention, in step S4, the classification of the sample to be tested is determined by:

s1, inputting a sequence: inputting the constructed consistency sequence;

s2, carrying out multi-sequence comparison with the existing HIV-1Subtype Database typing list to obtain a primary uncorrected difference value;

s3, screening off amino acids related to the published monitoring condition of the drug-resistant site to obtain a corrected difference value;

s4, combining the corrected difference value and upper limit typing of the breakpoint coverage rate, wherein,

s4.1, when the corrected difference value is more than 11%, defining the sample to be classified as an unknown subtype;

s4.2, when the corrected difference value is less than or equal to 11 percent, wherein,

According to the HIV subtype classification method of the present invention, 1kb sequence spanning PR and RT regions on pol gene of HIV sample to be classified is sequenced.

According to the HIV subtype classification method, the classification result of the sample to be detected and the nucleic acid level similarity of the sample to be detected and the optimal classification result are output.

According to the HIV subtype classification method of the present invention, the classification result of the sample to be tested and the similarity of the amino acid levels of the sample to be tested and the optimal classification result are output.

The HIV subtype classification method according to the present invention further comprises the steps of: and directly performing blast comparison on the consistency sequence of the sample to be detected to obtain a sequence comparison result of the top ten of sequence similarity ranks for auxiliary typing judgment.

According to the HIV subtype classification method of the invention, the public database NCBI adds the HIV sequence to supplement the database pool regularly.

According to the HIV subtype classification method, after the sample to be detected is classified, the sample to be detected is judged to be of a new subtype, and then the data of the sample to be detected is supplemented to the database pool. And after the database construction script pool collects the sequence data, storing the sequence to the database pool according to the source, thereby completing database expansion and generating a database sample information hash table for calling a parting module during working.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:

1. the HIV subtype classification system constructed by the invention comprises a database pool, a classification module and a data management module, wherein the database pool contains known HIV sequences of all genotypes and gene subtypes. The database management module can automatically complete the downloading work of the public database data at regular intervals, and automatically complete the construction and expansion work of the comparison database and the integration work of the database pool.

2. By introducing the database pool and the three typing modules, the accuracy and efficiency of HIV typing work are greatly improved. In addition, the parting module greatly improves the performance of the parting tool by using parallel computing packages in R.

3. The user only needs to input the HIV sequencing result, the database system can automatically complete the data standardization and sequence typing work, and the user can continuously record the newly obtained standardized sequence into the database pool according to the requirement. Through development and test, the genotyping function of the database is more mature and complete compared with that of a public database. The database can be used for screening, integrating and recording sequence information which is uploaded to the database by a user for analysis each time into the database, so that the capacity expansion of the database is realized. The public database requires that the data format uploaded by the user must be a specified format such as a.codfreq or.aavf format file. However, most of the second-generation sequencing data formats given by the existing sequencing platforms in the market are in the bam or fasta format, and users can utilize the public databases to perform genotyping work by manually converting the data formats by using third-party software, so that the efficiency of data analysis is greatly limited. The database can directly perform genotyping on the bam or fasta format file submitted by the user without manually preprocessing data by the user, so that the efficiency of data analysis work is improved, and the database has higher flexibility.

4. The sequences recorded in the existing public databases have uneven quality, and a plurality of sequences often contain degenerate bases. The HIV typing reference sequence adopted by the invention does not contain degenerate basic groups in the sequence after being screened. The HIV typing reference sequence adopted by the invention is derived from second-generation sequencing data, the sequencing depth is more than 1000 multiplied, and the data quality is good.

5. Most of the existing databases are set up to serve scientific research work and are influenced by space-time distribution and network conditions, so that the centralized analysis task of mass data is hard to be performed. The annotation tools on these databases mostly adopt a single-thread operation mode when performing data analysis tasks, i.e. one sample is analyzed first and the next sample is analyzed. The parting tool of the database executes the analysis tasks in a multithread mode, can perform data analysis work on 10 samples at most, and can greatly improve the working efficiency and save the working time compared with the existing database when facing the data analysis tasks of a large number of samples. This is one of the advantages of the database of the present invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a method for HIV subtype classification according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an HIV subtype classification system architecture according to an embodiment of the present application;

FIG. 3 shows the typing principle of the typing submodule of the HIV secondary sequencing data;

FIG. 4 is an output page of the typing results of sample HIV-ZD-6 i-2;

FIG. 5 is an output page of the results of reads vs. typing top10 for sample HIV-ZD-6 i-2;

FIG. 6 is an output page of statistical distribution of reads vs. typing top10 results for sample HIV-ZD-6 i-2;

FIG. 7 is an output page of the results of reads vs. typing top10 for sample 65;

FIG. 8 is an output page of the results of the typing of sample 65;

FIG. 9 is an output page of statistical distribution of reads versus typing top10 results for sample 65;

FIG. 10 is an output page of the sample typing end result;

FIG. 11 is a page showing the output of the 10 sequences with the highest similarity and their typing results from the test sample and the database;

FIG. 12 is an output page of 10 optimal alignment results obtained from the comparison of consensus sequences to public database;

FIG. 13 is an output page of the results of comparing sequencing reads to HIVdb and statistically comparing reads of different subtypes.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.