Variant sequence annotation method

文档序号:1143093 发布日期:2020-09-11 浏览:8次 中文

阅读说明:本技术 一种变异序列的注释方法 (Variant sequence annotation method ) 是由 文文 王红阳 朱赢 陈淑桢 何慧斯 高勇 汪德鹏 于 2020-05-25 设计创作,主要内容包括:本发明属于生物信息技术领域,具体涉及一种变异序列注释方法,所述方法包括:(1)确定变异序列信息:获得变异序列,整合参考序列信息,标准化变异信息;(2)变异注释,注释结果包括注释功能区域、变异类型、核酸序列、氨基酸序列。该方法不仅能实现行业金标准ANNOVAR的现有功能,而且克服了ANNOVAR中的缺点,在区别剪接位点和剪接区域变异、CDS边缘变异、注释frameshift和stoploss/stopgain等方面进行了完善,而且使用了规范的表示方式,还增加了基因编号Entrez ID,具有更好的应用价值。(The invention belongs to the technical field of biological information, and particularly relates to a variant sequence annotation method, which comprises the following steps: (1) determining variant sequence information: obtaining variant sequences, integrating reference sequence information and standardizing variant information; (2) and (4) variant annotation, wherein the annotation result comprises an annotated functional region, variant types, nucleic acid sequences and amino acid sequences. The method can not only realize the existing functions of the ANNOVAR of the trade golden standard, but also overcome the defects in the ANNOVAR, is perfect in distinguishing the variation of splicing sites and splicing regions, CDS edge variation, annotation of frameshift, stoppages/stopgain and the like, uses a standard representation mode, increases the gene number Entrez ID, and has better application value.)

1. A method for annotating a variant sequence, comprising the steps of:

(1) determining variant sequence information

(1.1) obtaining variant sequences:

comparing the sequence to be analyzed with a reference genome by using variation analysis software to obtain variation information;

(1.2) integrating reference sequence information:

acquiring a reference genome sequence and a reference genome annotation file; extracting a reference genome transcript and a CDS sequence from the reference genome sequence according to the annotation file;

acquiring Entrez ID information corresponding to the genes in the reference genome according to the description information of the genes;

integrating the reference genome transcript, the CDS sequence, the reference genome annotation file and the Entrez ID information to obtain integrated reference genome information;

(1.3) normalizing the variant information

Extracting chromosome information, reference genome physical position, reference genome sequence and variant sequence information of each variant from each variant information obtained in the step (1.1), and carrying out standardization treatment to obtain standardized variant information;

the normalized variation information includes: chromosomal information, starting position, ending position, normalized reference genomic sequence, normalized variant sequence;

(2) variant notes

(2.1) annotating functional regions

The method for determining the variation and the relative position of the element according to the standardized variation information comprises the following steps: the variation is located at the edge of the element and the variation is located at the depth of the element; the element edge is the length of the starting position or the ending position of the variant from the adjacent edge of the element is less than or equal to xbp, and the element depth is the length of the starting position or the ending position of the variant from the adjacent edge of the element is greater than xbp;

when the element edge is positioned, further distinguishing the edge position from the edge area; the edge site refers to the start position or the end position being within + -ybp of the element adjacent edge, the edge region refers to the region where the start position or the end position is within-ybp to-xbp or + ybp to + xbp of the element adjacent edge, and y is less than x;

the elements include UTR, CDS and Intron;

(2.2) annotating variant types

If the starting and ending positions of a mutation are both in the non-CDS region, the annotation is empty;

if the start position and/or the end position of one variation is located in the CDS region, translating the reference cDNA sequence into a reference amino acid sequence, replacing the base in the reference cDNA with a variation base to obtain a variation cDNA sequence, and translating into a variation amino acid sequence; then, by comparing the reference cDNA sequence with the variant cDNA sequence, the reference amino acid sequence with the variant amino acid sequence, classifying and annotating the variant types according to single base variation, insertion variation and deletion variation;

(2.3) annotation of nucleic acid sequence variations

Comparing the reference cDNA sequence with the variant cDNA sequence, and annotating the nucleic acid variation information of the variant cDNA sequence according to the HGVS rule;

(2.4) annotation of amino acid sequence variations

Amino acid variation information for the variant amino acid sequence is annotated according to the HGVS rule by comparing the reference amino acid sequence to the variant amino acid sequence, wherein the amino acids are represented using three-letter abbreviations.

2. The method for annotating variant sequences according to claim 1, wherein said method for extracting reference genomic transcripts and CDS sequences of step (1.2) is:

extracting all reference genome transcripts and CDS sequences from the reference genome sequences in units of chromosomes according to the reference genome annotation file;

or reading all reference genome sequences at once, and then extracting reference genome transcripts and CDS sequences according to the reference genome annotation file.

3. The method for annotating a variant sequence according to claim 1, wherein the integrated reference genomic information of step (1.2) is indexed by: and cutting the reference genome into a plurality of windows by taking the chromosome as a unit and a certain step length, and acquiring transcript information contained in each window according to the reference genome annotation file.

4. The method for annotating a variant sequence according to claim 3, wherein said step size is 300 kb.

5. The method for annotating variant sequences according to claim 1, wherein said normalization process of step (1.3) is as follows:

when the length of the reference genome sequence and the variant sequence is equal to 1 at the same time, the starting position is equal to the ending position, and the reference genome physical position is equal to the ending position;

when the length of the reference genomic sequence is different from or the same as that of the variant sequence but is not equal to 1, the same base in both is deleted, the left base length of the deleted reference genomic sequence is designated as LEN, and the start position and the end position are determined as follows:

when the normalized reference genomic sequence length is 0, the starting position is the reference genomic physical position + LEN-1; when the normalized reference genomic sequence length is greater than 0, the starting position is the reference genomic physical position + LEN;

when the normalized reference genomic sequence length is less than or equal to 1, the end position is the start position; when the normalized reference genomic sequence length is greater than 1, the end position is the starting position + the normalized reference genomic sequence length-1.

6. The method for annotating a variant sequence according to claim 1, wherein the method for annotating a functional region in step (2.1) is specifically as follows:

a. variants are located UpStream or DownStream, annotated as UpStream or DownStream;

b. the variation is located in an element of the UTR,

the variation is located deep within the UTR, annotated as UTR3 or UTR 5;

if the variation is located at the edge of the UTR and the element adjacent to the edge is a non-Intron region, the variation is annotated as UTR3 or UTR 5;

the variation is located at the edge of the UTR, and the elements adjacent to the edge are Intron: if the position is at the edge position, the mark is UTR3_ spicing _ site or UTR5_ spicing _ site; if the position is in the edge area, the mark is UTR3_ filming _ region or UTR5_ filming _ region;

c. the mutation is located in the CDS element,

mutations are located deep in the CDS, annotated as exonic;

the variation is located at the CDS edge and the elements adjacent to this edge are non-Intron regions, annotated as exonic;

the variation is located at the CDS edge, and the elements adjacent to the edge are Intron: if the position is located at the edge position, the position is annotated as CDS _ scrolling _ site; if the edge region is located, the mark is CDS _ scrolling _ region;

d. the variation is located in the Intron element(s),

the variation is located deep in Intron, annotated as Intron;

the variation is located at the Intron edge: if the position is at the edge position, the comment is the spicing _ site; if the position is in the edge area, the comment is a scrolling _ region;

the variation spans the connection point of Intron to the neighboring element, noted as slipping _ site;

the variant is a start or end position of a variant in the normalized variant information.

7. The method of annotating a variant sequence according to claim 6, wherein x is 10 and y is 2.

8. The method for annotating variant sequences according to claim 1, wherein said step (2.2) of annotating variant types comprises:

a. for single base variations

If the reference amino acid sequence is identical to the variant amino acid sequence, it is annotated synonymous _ snv

If the reference amino acid sequence is different from the variant amino acid sequence, the annotation is nonynonymous _ snv

b. For insertion variation

Comparing the position of the stop codon of the variant cDNA sequence with the reference cDNA sequence when the difference between the length of the variant cDNA sequence and the length of the reference cDNA sequence is a multiple of 3: if a stop codon appears in advance in the variant cDNA sequence, it is annotated as ins _ nonframeshift _ stopgain; if the stop codon in the variant cDNA sequence disappeared, note ins _ nonframeshift _ stoplos; if the termination code of the variant cDNA sequence normally appears at the end, setting the variant type as ins _ nframeshift;

comparing the position of the stop codon of the variant cDNA sequence with the reference cDNA sequence when the difference between the length of the variant cDNA sequence and the length of the reference cDNA sequence is not a multiple of 3: if the variant cDNA sequence has a stop codon in advance, it is annotated as ins _ frame _ stopgain; if the stop codon disappeared in the variant cDNA sequence, it was annotated as ins _ frame _ stores; if the termination codon of the variant cDNA sequence normally appears at the end, it is annotated as ins _ frame shift;

c. for deletion mutation

Comparing the position of the stop codon of the variant cDNA sequence with the reference cDNA sequence when the difference between the length of the variant cDNA sequence and the length of the reference cDNA sequence is a multiple of 3: if a stop codon appears in the variant cDNA sequence in advance, the sequence is annotated as del _ nframeshift _ stopgain; when the stop codon disappeared in the variant cDNA sequence, the sequence was annotated as del _ nframeshift _ stoplos; if the termination code of the variant cDNA sequence normally appears at the end, annotated as del _ nframeshift;

comparing the position of the stop codon of the variant cDNA sequence with the reference cDNA sequence when the difference between the length of the variant cDNA sequence and the length of the reference cDNA sequence is not a multiple of 3: if the variant cDNA sequence has a stop codon in advance, the sequence is annotated as del _ frame _ stopgain; when the stop codon disappeared in the variant cDNA sequence, the sequence is annotated as del _ frame shift _ stoplos; if the termination codon of the variant cDNA sequence normally appears at the end, it is annotated as del _ frame shift.

Technical Field

The invention belongs to the technical field of biological information, and particularly relates to a method for annotating a variant sequence.

Background

With the development of sequencing technology, sequencing throughput is continuously increased, sequencing cost is continuously reduced, and more species have acquired genome and transcriptome information. In the field of subdivision, there is an increasing interest in variations between different varieties or populations of the same species, or even between different individuals, in order to seek phenotypic differences resulting from variations in individual genetic information in a large genetic background. This presents challenges to the search and annotation of variant sequences.

Taking human as an example, ANNOVAR is the mainstream software for annotating mutation and is considered as a gold standard in the industry, but in practical use, the inventor finds that ANNOVAR fails to solve the following problems:

in the process of forming the transcript, different splicing sites are selected to combine in the mRNA precursor in different splicing modes to generate different splicing isomers; wherein, the splice sites are the edges of the corresponding elements. It is well recognized in the industry that variations within + -2bp of the splice site have an effect on gene splicing. However, many studies have shown that mutations in the region adjacent to the splice site, outside the + -2bp region of the splice site, also affect gene splicing. That is, it is more scientific and reasonable to distinguish and annotate the variation at the splice site and the variation near the splice site. However, ANNOVAR is only a general annotation of the spliced region and is not distinguished.

In addition, studies have shown that: in addition to the effect of mutations in the splicing region on gene splicing, mutations at the CDS edge also affect gene splicing. Ann voar did not make specific annotations or markers for such sites.

For some InDel variations, when the variation type simultaneously appears in both frames shift and stoppages/stoppages, ANNOVAR can lose one of the frames shift or stoppages/stoppages, so that annotation information is lost.

In the subsequent research of genes, the gene name (symbol) of the same gene is frequently changed due to the characteristics of the gene naming rules, which results in that the name of the gene annotated by the same variation is different under the annotation of different versions of databases. Currently, many authoritative databases such as NCBI, OMIM, etc. have begun to introduce gene entry into the entry z ID to label the gene name to ensure the uniqueness of the annotation result.

The human Genome Variation society, HGVS (human Genome Variation society), sets forth the currently accepted rules for mutation nomenclature (http:// varnomen. HGVS. org /), but ANNOVAR by default does not use the HGVS canonical nomenclature. Meanwhile, in the rules of protein naming, HGVS suggests the use of amino acid three-letter abbreviations such as p.arg727ser, whereas ANNOVAR uses amino acid one-letter abbreviations not meeting the specification recommendations such as p.r 727s.

Disclosure of Invention

In view of the above, the present invention provides a method for annotating a variant sequence, which not only can implement the existing functions of ANNOVAR, but also can implement the functions of distinguishing splice sites and splice region variants, adding CDS edge variant annotations, completely including the functions of frameshift and stoploss/stopgain information, and the like, and has a more standard output form.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

a method of sequence variation annotation comprising the steps of:

(1) determining variant sequence information

(1.1) obtaining variant sequences:

comparing the sequence to be analyzed with a reference genome by using variation analysis software to obtain variation information;

(1.2) integrating reference sequence information:

acquiring a reference genome sequence and a reference genome annotation file; extracting all reference genome transcripts and CDS sequences from the reference genome sequence according to the annotation file;

acquiring Entrez ID information corresponding to the genes in the reference genome according to the description information of the genes;

integrating the reference genome transcript, the CDS sequence, the reference genome annotation file and the Entrez ID information to obtain integrated reference genome information;

(1.3) normalizing the variant information

Extracting chromosome information, reference genome physical position, reference genome sequence and variant sequence information of each variant from each variant information obtained in the step (1.1), and carrying out standardization treatment to obtain standardized variant information;

the normalized variation information includes: chromosomal information, starting position, ending position, normalized reference genomic sequence, normalized variant sequence;

(2) variant notes

(2.1) annotating functional regions

The method for determining the variation and the relative position of the element according to the standardized variation information comprises the following steps: the variation is located at the edge of the element and the variation is located at the depth of the element; the element edge is that the starting position or the ending position is less than or equal to xbp from the adjacent edge of the element, and the element depth is that the starting position or the ending position is greater than xbp from the adjacent edge of the element; it should be noted that, since each element has two edges corresponding to the starting position and the ending position of the element, the edge refers to the edge which is relatively closer of the two edges when compared.

When the starting position or the ending position is positioned at the edge of the element, further distinguishing the edge position from the edge area; the edge site refers to the start position or the end position being within + -ybp of the element adjacent edge, the edge region refers to the region where the start position or the end position is within-ybp to-xbp or + ybp to + xbp of the element adjacent edge, and y is less than x;

the elements include UTR, CDS and Intron;

(2.2) annotating variant types

If the starting and ending positions of a mutation are both in the non-CDS region, the annotation is empty;

if the start position and/or the end position of one variation is located in the CDS region, translating the reference cDNA sequence into a reference amino acid sequence, replacing the base in the reference cDNA with a variation base to obtain a variation cDNA sequence, and translating into a variation amino acid sequence; then, by comparing the reference cDNA sequence with the variant cDNA sequence, the reference amino acid sequence with the variant amino acid sequence, classifying and annotating the variant types according to single base variation, insertion variation and deletion variation;

(2.3) annotation of nucleic acid sequence variations

Comparing the reference cDNA sequence with the variant cDNA sequence, and annotating the nucleic acid variation information of the variant cDNA sequence according to the HGVS rule;

(2.4) annotation of amino acid sequence variations

Amino acid variation information for the variant amino acid sequence is annotated according to the HGVS rule by comparing the reference amino acid sequence to the variant amino acid sequence, wherein the amino acids are represented using three-letter abbreviations.

In the above technical solution, the method for extracting the reference genome transcript and the CDS sequence in step (1.2) comprises: extracting all reference genome transcripts and CDS sequences from the reference genome sequences by taking chromosomes as units according to the physical position information of each transcript in the reference genome annotation file; or reading all reference genome sequences at one time, and then extracting reference genome transcripts and CDS sequences according to the physical position of each transcript in the reference genome annotation file; compared with the two schemes, the first extraction method consumes less memory resources and has higher speed.

In the above technical solution, for the integrated reference genome information described in step (1.2), an information index is established, and the specific method is as follows: cutting a reference genome into a plurality of windows by taking a chromosome as a unit and a certain step length, and acquiring transcript information contained in each window according to physical position information in a reference genome annotation file; further, the step size is 300 kb. The index is established to facilitate fast information retrieval, and the step size directly affects the number of indexes, the computer operation speed and the memory.

In the above technical solution, the standardization processing method in step (1.3) is as follows:

when the length of the reference genome sequence and the variant sequence is equal to 1 at the same time, the starting position is equal to the ending position, and the reference genome physical position is equal to the ending position;

when the length of the reference genomic sequence is different from or the same as that of the variant sequence but is not equal to 1, the same base in both is deleted, the left base length of the deleted reference genomic sequence is designated as LEN, and the start position and the end position are determined as follows:

when the normalized reference genomic sequence length is 0, the starting position is the reference genomic physical position + LEN-1; when the normalized reference genomic sequence length is greater than 0, the starting position is the reference genomic physical position + LEN;

when the normalized reference genomic sequence length is less than or equal to 1, the end position is the start position; when the normalized reference genomic sequence length is greater than 1, the end position is the starting position + the normalized reference genomic sequence length-1.

In the above technical solution, the functional region is annotated for the purpose of determining which functional region of the gene the variant sequence is located in, and the method for annotating the functional region in step (2.1) specifically comprises:

a. variants are located UpStream or DownStream, annotated as UpStream or DownStream;

b. the variation is located in an element of the UTR,

the variation is located deep within the UTR, annotated as UTR3 or UTR 5;

if the variation is located at the edge of the UTR and the element adjacent to the edge is a non-Intron region, the variation is annotated as UTR3 or UTR 5;

the variation is located at the edge of the UTR, and the elements adjacent to the edge are Intron: if the distance edge length is less than or equal to y, the result is annotated as UTR3_ splicing _ site or UTR5_ splicing _ site; if the distance edge length is greater than y and less than x, the distance edge length is annotated as UTR3_ splicing _ region or UTR5_ splicing _ region;

c. the mutation is located in the CDS element,

mutations are located deep in the CDS, annotated as exonic;

the variation is located at the CDS edge and the elements adjacent to this edge are non-Intron regions, annotated as exonic;

the variation is located at the CDS edge, and the elements adjacent to the edge are Intron: if the distance edge length is less than or equal to y, the distance edge length is noted as CDS _ distributing _ site; if the distance edge length is larger than y and smaller than x, the distance edge length is noted as CDS _ scrolling _ region;

d. the variation is located in the Intron element(s),

the variation is located deep in Intron, annotated as Intron;

the variation is located at the Intron edge: if the distance edge length is less than or equal to y, the comment is a spicing _ site; if the distance edge length is larger than y and smaller than x, the distance edge length is annotated as the scrolling _ region;

the variation spans the connection point of Intron to the neighboring element, noted as slipping _ site;

the variant is a start or end position of a variant in the normalized variant information.

In the above technical solution, x is 10, and y is 2.

In the above technical solution, the method for annotating the mutation type in step (2.2) specifically includes:

a. for single base variations

If the reference amino acid sequence is identical to the variant amino acid sequence, it is annotated synonymous _ snv

If the reference amino acid sequence is different from the variant amino acid sequence, the annotation is nonynonymous _ snv

b. For insertion variation

Comparing the position of the stop codon of the variant cDNA sequence with the reference cDNA sequence when the difference between the length of the variant cDNA sequence and the length of the reference cDNA sequence is a multiple of 3: if a stop codon appears in advance in the variant cDNA sequence, it is annotated as ins _ nonframeshift _ stopgain; if the stop codon in the variant cDNA sequence disappeared, note ins _ nonframeshift _ stoplos; if the termination code of the variant cDNA sequence normally appears at the end, setting the variant type as ins _ nframeshift;

comparing the position of the stop codon of the variant cDNA sequence with the reference cDNA sequence when the difference between the length of the variant cDNA sequence and the length of the reference cDNA sequence is not a multiple of 3: if the variant cDNA sequence has a stop codon in advance, it is annotated as ins _ frame _ stopgain; if the stop codon disappeared in the variant cDNA sequence, it was annotated as ins _ frame _ stores; if the termination codon of the variant cDNA sequence normally appears at the end, it is annotated as ins _ frame shift;

c. for deletion mutation

Comparing the position of the stop codon of the variant cDNA sequence with the reference cDNA sequence when the difference between the length of the variant cDNA sequence and the length of the reference cDNA sequence is a multiple of 3: if a stop codon appears in the variant cDNA sequence in advance, the sequence is annotated as del _ nframeshift _ stopgain; when the stop codon disappeared in the variant cDNA sequence, the sequence was annotated as del _ nframeshift _ stoplos; if the termination code of the variant cDNA sequence normally appears at the end, annotated as del _ nframeshift;

comparing the position of the stop codon of the variant cDNA sequence with the reference cDNA sequence when the difference between the length of the variant cDNA sequence and the length of the reference cDNA sequence is not a multiple of 3: if the variant cDNA sequence has a stop codon in advance, the sequence is annotated as del _ frame _ stopgain; when the stop codon disappeared in the variant cDNA sequence, the sequence is annotated as del _ frame shift _ stoplos; if the termination codon of the variant cDNA sequence normally appears at the end, it is annotated as del _ frame shift.

The invention has the beneficial effects that: compared with the ANNOVAR standard in the industry, the method provided by the invention has the advantages that the number of annotations is consistent with the large classification, the defects in the ANNOVAR are overcome, scientific and detailed classification is carried out on the aspects of distinguishing splice sites and splice region variation, CDS edge variation, frameshift and stoploss/stopgain deletion and the like, a normative representation mode is used, and the gene number Entrez ID is increased.

Detailed Description

In order that the invention may be better understood, further details of the invention are set forth in the following examples.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种分析识别淋巴管浸润的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!