Method for filtering sequence contamination among immune repertoire high-throughput sequencing samples

文档序号:1044854 发布日期:2020-10-09 浏览:28次 中文

阅读说明:本技术 对免疫组库高通量测序样本间序列污染进行过滤的方法 (Method for filtering sequence contamination among immune repertoire high-throughput sequencing samples ) 是由 张伟 罗礼华 刘晓 于 2019-03-28 设计创作,主要内容包括:本发明公开了一种对免疫组库高通量测序样本间序列污染进行过滤的方法。本发明方法包括对免疫组库高通量测序所得有效数据进行泳道内样本间低频过滤、泳道间样本间低频过滤和核苷酸序列多样性过滤的步骤。在免疫组库(TCR&BCR)建库和测序方法中测序污染一直是个无可避免的问题,影响着数据的可信性。但是,在免疫组库的计算机分析研究领域,系统且通用性的污染过滤方法一直缺失。本发明填补了这块空白,针对测序过程中可能的污染源进行准确的过滤,保证了后续数据分析的准确性。(The invention discloses a method for filtering sequence pollution among immune repertoire high-throughput sequencing samples. The method comprises the step of carrying out intra-lane sample low-frequency filtration, inter-lane sample low-frequency filtration and nucleotide sequence diversity filtration on effective data obtained by high-throughput sequencing of the immune repertoire. Sequencing contamination has always been an inevitable problem in the construction of immune repertoires (TCR & BCR) and sequencing methods, affecting the credibility of the data. However, in the field of computer analysis research of immune repertoires, systematic and versatile contamination filtration methods have been lacking. The method fills the blank, accurately filters possible pollution sources in the sequencing process, and ensures the accuracy of subsequent data analysis.)

1. A method of filtering sequence contamination between immune repertoires high throughput sequencing samples, comprising the steps of:

(A) carrying out inter-sample low-frequency filtering on effective data obtained by high-throughput sequencing of the immune repertoire, and outputting filtered data;

(B) performing inter-lane sample low-frequency filtering on the filtered data output in the step (A), and outputting the filtered data;

(C) and (C) carrying out nucleotide sequence diversity filtration on the filtered data output in the step (B), and outputting final effective data.

2. The method of claim 1, wherein: in the step (A), the effective data is subjected to inter-lane sample low-frequency filtering according to a method comprising the following steps:

(A1) merging clones of all samples in the same lane, and statistically calculating the frequency of each clone in each sample;

(A2) if a clone A exists between two samples in the same lane, and the frequency ratio of the clone A in the two samples is larger than a threshold value alpha, filtering out the clone A in a low-frequency sample in the two samples.

3. The method according to claim 1 or 2, characterized in that: in the step (B), inter-lane sample low-frequency filtering is carried out on the filtered data output by the step (A) according to a method comprising the following steps:

(B1) merging clones of all samples, and statistically calculating the ratio of the number of samples of each clone appearing in each lane; the sample number ratio is the ratio of the sample number of a certain clone B appearing in a certain lane a to the total sample number in the lane a;

(B2) if the sample number ratio of a certain clone C in a certain lane B is higher than a threshold value beta, filtering the clone C in the lane B according to the step (B3);

(B3) sequencing the samples in the lane b according to the frequency of the clone C appearing in each sample of the lane b from low to high, and filtering the clone C in the samples one by one from the sample with the lowest frequency until the proportion of the number of the samples of the clone C in the lane b is less than or equal to the threshold value beta.

4. The method of claim 3, wherein: in the step (A), the threshold alpha is used for counting the frequency distribution of clones according to the sequencing results of the same phenotype sample set in a lane and among lanes, and a value with discrimination on the frequency of the same clone in a lane and the frequency of the same clone in other lanes is selected as the threshold alpha;

further, the threshold α is 2000: 1; and/or

In step (B), the threshold value β is 5 times the average value of the sample number ratios of the clone C in the lanes other than the lane B.

5. The method according to any one of claims 1-4, wherein: in the step (C), the filtered data output from the step (B) is subjected to nucleotide sequence diversity filtering according to a method comprising the following steps:

(C1) translating all cloned nucleotide sequences of all samples into corresponding amino acid sequences, and counting common amino acid sequences appearing in at least N samples;

(C2) if in all samples in which an amino acid sequence M is present, the amino acid sequence M is translated from the same nucleotide sequence M, the nucleotide sequence M is filtered out from all samples.

6. The method of claim 5, wherein: in the step (C), the N samples are 8-12 samples.

7. The method according to any one of claims 1-6, wherein: the method further comprises, after step (C), the step (D) of:

(D) counting the percentage of the nucleotide sequence filtered out in each of the steps (A), (B) and (C) to the total nucleotide sequence for each sample; if the percentage of nucleotide sequences in a sample X that are filtered out in a step relative to the total nucleotide sequence is above a threshold value γ, all data for that sample X are filtered out.

8. The method of claim 7, wherein: the threshold value gamma is 20%.

9. A system for filtering sequence contamination between immune repertoire high-throughput sequencing samples, system I or system II;

the system I comprises a device A, a device B and a device C;

the system II comprises a device A, a device B, a device C and a device D;

said device a being capable of carrying out step (a) as defined in any one of claims 1 to 8; said device B being capable of carrying out step (B) of any one of claims 1-8; said device C being capable of carrying out step (C) of any one of claims 1 to 8; the device D is capable of carrying out step (D) as claimed in claim 7 or 8.

10. Use of the system of claim 9 for filtering sequence contamination between immune repertoires high throughput sequencing samples.

Technical Field

The invention relates to the field of bioinformatics, in particular to a method for filtering sequence contamination among immune repertoire high-throughput sequencing samples.

Background

The Immune Repertoire (IR) refers to the sum of functionally diverse T lymphocytes and B lymphocytes in a certain body in a specific time, and the TCR (T cell receptor) is a receptor positioned on the surface of the T cell, the BCR (B cell receptor) is an immunoglobulin positioned on the surface of the B cell and used for identifying an antigen and receiving the stimulation of the antigen to start an immune response, the TCR and the BCR are both composed of two chains (a heavy chain and a light chain, or α and β chains) which respectively contain 450-550 nucleotides or 211-217 amino acid residues, and each chain contains 450-550 nucleotides or 211-217 amino acid residuesSome regions are highly polymorphic, called variable regions (CDRs), where diversity is highest and spatial complementarity to an epitope is possible, the α chain of the TCR and the BCR heavy chain are encoded by the V, D, J gene cluster, the β chain of the TCR and the BCR light chain are encoded by the V, J gene cluster, a large number of V (D) J genes are arranged in tandem on the same chromosome, separated from each other by introns, during the development of T, B lymphocytes, the V (D) J genes undergo gene rearrangement, except that the V (D) J genes undergo random combinations to generate a large amount of diversity, random insertion or deletion of nucleotides during VD or DJ ligation further enriches the diversity of receptors18Unique TCR and 2 × 1012Unique BCRs, thus constituting a vast antigen recognition receptor library, i.e., an immune repertoire.

To capture such a highly diverse repertoire of immune cells, specific amplification of the TCR and BCR gene regions using specific primers followed by high throughput sequencing is generally performed. Three methods of experimental capture are currently used, namely multiplex PCR (Polymerase Chain Reaction), 5' RACE (Rapid amplification of cDNAsends) technology, and UID (unique molecular identifier) technology for random tag sequence synthesis. In the last decade, the immune repertoire technology has been used in many scientific studies and applications, including the pathogenic clone detection of leukemia and immune recovery monitoring after treatment, the immune microenvironment and immunotherapy of tumors, the evaluation of immune response before and after vaccine and different vaccine effects, the rapid screening of monoclonal antibodies and the identification of neutralizing antibodies against HIV infection. In particular, the immunohistochemical library technique has great advantages in the treatment recovery monitoring of leukemia, better sensitivity and more systematic immune evaluation. In recent years, the research on the tumor immune microenvironment is rapidly developed, the immune repertoire technology also plays an important role, and the evolution and differentiation of lymphocytes can be accurately analyzed by using TCR as a recognition marker of T lymphocytes; also, TCR plays a decisive role in immunotherapy.

In the field of immune repertoire, previous researches only use a small amount of samples, because the pollution rate of the small amount of samples in the process of establishing a repertoire and sequencing is very low, and the research and analysis of projects cannot be influenced even if polluted sequences exist. Of course, the small number of samples is also due to cost considerations and limitations on the understanding of research in this field. In the last year, large-scale sequencing of immune repertoires has begun, and this contamination problem has been recently discovered, and therefore, there is no analytical method or concept for filtering contamination sequences between samples.

In the experimental capture and amplification of TCR and BCR in the immunohistological libraries, PCR amplification is usually performed in a 96-well plate, with one sample in each well. For high throughput sequencing, because the amount of sequencing required for the immunohistorian sample is not large, there are typically multiple samples mixed (pooling) in one sequencing lane (lane), such as 48 samples or 96 samples sequenced on one lane, and for these samples, a tag sequence is typically added for differentiation. This process of amplification and sequencing of multiple samples together introduces cross-contamination between samples, i.e., sequences in one sample and eventually a small amount also present in another sample. Contamination with this sequence, the diversity of TCR and BCR identifications, and the analysis of consensus clones between samples, caused a tremendous interference. For example, a TCR would have been present in only one sample, but due to sample-to-sample contamination, it was found that this TCR clone was present in all 10 samples, and it was erroneously assumed that this clone was enriched in this group of samples, or was erroneously assumed to be an antigen-associated clone.

Disclosure of Invention

In view of the above problems, the present invention provides a method for removing contaminating sequences from an immune repertoire among high-throughput sequencing samples by using information analysis.

In a first aspect, the invention claims a method of filtering sequence contamination between immune repertoires high throughput sequencing samples.

The method for filtering sequence contamination among immune repertoire high-throughput sequencing samples provided by the invention comprises the following steps:

(A) carrying out inter-sample low-frequency filtering on effective data obtained by high-throughput sequencing of the immune repertoire, and outputting filtered data;

(B) performing inter-lane sample low-frequency filtering on the filtered data output in the step (A), and outputting the filtered data;

(C) and (C) carrying out nucleotide sequence diversity filtration on the filtered data output in the step (B), and outputting final effective data.

In step (a), the effective data obtained by high-throughput sequencing of the immune repertoire can be obtained according to a method comprising the following steps: the effective data can be obtained by performing basic processing and conventional information analysis on the off-line data obtained by high-throughput sequencing of the immune repertoire (using conventional immune repertoire analysis software such as IMonitor and using parameters of a-b-A1-A2-o-n-t-Rs).

Briefly, basic data processing, v (d) J assignment, sequence structure analysis, and data statistics and visualization are included. First, low quality reads (reads) are filtered and paired reads are merged (making the sequence longer); secondly, comparing the matched sequence with the existing V/D/J reference sequence, and distributing the corresponding V/D/J gene according to the comparison score; thirdly, correcting PCR and sequencing errors, determining sequence structure and translation, and filtering out sequences which are not aligned to V, J sequences and have no CDR3 region; and finally, carrying out data statistics and graphic display on the effective sequence.

In sequencing of multiple samples, clones with higher frequency ("clone" is an immunological professional term defining a TCR (T cell receptor) or BCR (B cell receptor) nucleotide sequence are easily contaminated in other samples of the same lane, so that clones with high frequency appear in other samples with relatively lower frequency. Therefore, the effective data may be subjected to inter-lane sample low frequency filtering in step (a) according to a method comprising the following steps:

(A1) merging clones of all samples in the same lane, and statistically calculating the frequency of each clone in each sample;

(A2) if a clone A (identical in nucleotide sequence) is present between two samples in the same lane and the frequency ratio of the clone A in the two samples is greater than a threshold value alpha, the clone A in the low frequency sample in the two samples is filtered out. For example, if clone a (identical nucleotide sequence) is present in both sample 1 and sample 2 in the same lane, and the ratio of the frequency of clone a in sample 1 to the frequency of clone a in sample 2 is greater than the threshold α (and the frequency of clone a in sample 1 is greater than the frequency of clone a in sample 2), then the clone a in sample 2 is filtered out.

Further, the threshold α can be obtained by counting the frequency distribution of clones according to the sequencing results of the same phenotype sample set (the same phenotype is the same disease or the same healthy person) in the lanes and among the lanes, and selecting a value that is a difference between the frequency of the same clone in a lane and the frequency of the same clone in other lanes, namely the threshold α.

Further, in the present invention, the threshold α is specifically 2000: 1.

For a batch of samples of the same phenotype, random numbers of any one clone present in each lane are comparable, and if a lane is found to have more samples with a particular clone, it is likely that the lane is contaminated. Therefore, the filtered data output from step (a) may be subjected to inter-lane sample low frequency filtering in step (B) according to a method comprising the steps of:

(B1) merging clones of all samples, and statistically calculating the ratio of the number of samples of each clone appearing in each lane; the sample number ratio is the ratio of the sample number of a certain clone B appearing in a certain lane a to the total sample number in the lane a;

(B2) if the sample number ratio of a certain clone C in a certain lane B is higher than a threshold value beta, filtering the clone C in the lane B according to the step (B3);

(B3) sequencing the samples in the lane b according to the appearance frequency of the clone C in each sample of the lane b from low to high, and filtering the clone C in the samples one by one from the sample with the lowest frequency (not 0) until the sample number proportion of the clone C in the lane b is less than or equal to the threshold value beta.

Further, the threshold β may be 5 times the average of the sample number ratios of the clone C in the lanes other than the lane b.

Depending on the degeneracy of the codons, the same amino acid sequence can be translated from a plurality of different nucleotide sequences. Thus, in step (C), the filtered data output from step (B) may be subjected to nucleotide sequence diversity filtering according to a method comprising the steps of:

(C1) translating all cloned nucleotide sequences of all samples into corresponding amino acid sequences, and counting common amino acid sequences appearing in at least N samples;

(C2) if the amino acid sequence M is translated from the same nucleotide sequence M in all samples in which a certain amino acid sequence M appears, the nucleotide sequence M is considered as a pollution sequence, and the nucleotide sequence M in all samples is filtered.

Further, the N samples may be 8-12 samples, such as 10 samples.

The following step (D) may be further included after step (C):

(D) counting the percentage of the nucleotide sequence filtered out in each of the steps (A), (B) and (C) to the total nucleotide sequence for each sample; if the percentage of nucleotide sequences in the total nucleotide sequence of a sample X that is filtered out in a certain step is higher than the threshold value gamma, the data of the sample X is considered to be too contaminated for further use, and all the data of the sample X are filtered out.

Further, the threshold γ may be 20%.

In a second aspect, the invention claims a system for filtering sequence contamination between immune repertoires high throughput sequencing samples.

The system for filtering sequence pollution among immune repertoire high-throughput sequencing samples provided by the invention can be a system I or a system II;

the system I comprises a device A, a device B and a device C;

the system II comprises a device A, a device B, a device C and a device D;

said device a being capable of carrying out step (a) as set forth in the preceding first aspect; said device B being capable of carrying out step (B) as described in the preceding first aspect; said device C being capable of carrying out step (C) as described in the preceding first aspect; the apparatus D is capable of carrying out step (D) as described in the preceding first aspect.

If necessary, the system may further comprise a high-throughput sequencer and/or an instrument capable of performing basic processing and conventional information analysis (which may be performed by conventional immunohistochemical analysis software such as imoitor) on the offline data obtained by the immunohistochemical high-throughput sequencing to obtain the valid data in step (a).

In a third aspect, the invention claims the use of the system of the second aspect for filtering sequence contamination between immune repertoires high throughput sequencing samples.

In the above three aspects, the sample may be DNA or RNA extracted from blood, tissue. The high-throughput sequencing is multi-sample mixing (pooling) high-throughput sequencing, the sequencing platform is not limited and comprises illumina, BGIseq and the like, the sequencing type is not limited, and the sequencing can be single-ended sequencing or double-ended sequencing. In one embodiment of the invention, specifically a single end side of 200 bp.

The sample is subjected to capture and amplification on TCR of T cells or BCR of B cells by an immune repertoire experiment capture technology (such as multiplex PCR, 5' RACE technology, UID technology and the like). This capture can be by amplification of the entire TCR and BCR sequence, or by capture of only the most diverse variable region (CDR3 region). The amplification products were then subjected to multi-sample mixing (pooling) high throughput sequencing. And finally, obtaining off-line data of high-throughput sequencing of the immune repertoire.

Sequencing contamination has always been an inevitable problem in the construction of immune repertoires (TCR & BCR) and sequencing methods, affecting the credibility of the data. However, in the field of computer analysis research of immune repertoires, systematic and versatile contamination filtration methods have been lacking. The method fills the blank, accurately filters possible pollution sources in the sequencing process, and ensures the accuracy of subsequent data analysis.

Drawings

FIG. 1 is a flow chart of sequence contamination filtration among immune repertoire samples.

FIG. 2 is a comparison of the distribution of clones in a population before and after filtration. The left panel is before filtration, the right panel is after filtration, with the abscissa being the sample, the ordinate being the clone, the black indicating the presence of the clone in the sample, and the white indicating the absence of the clone in the sample.

Detailed Description

The experimental procedures used in the following examples are all conventional procedures unless otherwise specified.

Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于空间可分离性的利用基因检测的疾病预测方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!