Position anchoring bar code system for nanopore sequencing library building

文档序号:1350626 发布日期:2020-07-24 浏览:9次 中文

阅读说明:本技术 一种用于纳米孔测序建库的位置锚定条码系统 (Position anchoring bar code system for nanopore sequencing library building ) 是由 戴岩 胡龙 张烨 肖念清 任用 于 2020-04-09 设计创作,主要内容包括:本申请涉及一种用于纳米孔测序建库的位置锚定条码系统、制备方法及其应用。本申请所述位置锚定条码系统具有更高分辨率,更高分类准确度,能够显著降低鉴定假阳性率,从整体上提高纳米孔测序精度,降低测序成本。(The application relates to a position anchoring bar code system for nanopore sequencing library building, a preparation method and application thereof. The position anchoring bar code system has higher resolution and higher classification accuracy, can obviously reduce the false positive rate of identification, improves the nanopore sequencing precision on the whole, and reduces the sequencing cost.)

1. A position-anchored barcode system for nanopore sequencing pooling, the system comprising the structure:

[BARCODE-ANCHOR]n-BARCODEn+1

wherein n is more than or equal to 1,

the BARCODE is a bar code sequence,

the ANCHOR is an ANCHOR sequence.

2. The position-anchored barcode system of claim 1, wherein 1 ≦ n ≦ 10; preferably, n is 1, 2 or 3.

3. The position-anchored barcode system of claim 2, wherein the structure is

FLANK1-[BARCODE-ANCHOR]n-BARCODEn+1-FLANK2,

The F L ANK is a flanking sequence.

4. The position-anchored BARCODE system of any of claims 2 or 3, wherein the BARCODE sequences are the same or different; preferably, the BARCODE sequences are different.

5. The position-anchored barcode system of any of claims 2-4, wherein the ANCHOR sequences are the same or different; preferably, the ANCHOR sequences are different.

6. The position-anchored barcode system of any of claims 2-5, wherein the ANCHOR sequence is 5-50bp in length; preferably, the ANCHOR sequence is 10-35bp in length.

7. The position-anchored BARCODE system of any of claims 2-6, wherein the ANCHOR sequence has < 70% homology to the BARCODE sequence; preferably, the ANCHOR sequence has < 50% homology to the BARCODE sequence.

8. The position-anchored barcode system of any of claims 2-6, wherein the structure is any of:

FLANK1-BARCODE1-ANCHOR1-BARCODE2-FLANK2;

FLANK1-BARCODE1-ANCHOR1-BARCODE2-ANCHOR2-BARCODE3-FLANK2;

or

FLANK1-BARCODE1-ANCHOR1-BARCODE2-ANCHOR2-BARCODE3-ANCHOR3-BARCODEs-FLANK2。

9. A method of making a position-anchored barcode system of any of claims 1 to 8, wherein: the method comprises directly synthesizing the nucleotide sequence of the position anchoring bar code system, or preparing the position anchoring bar code system by connecting after segmented synthesis.

10. A method of sequencing library construction, wherein a position-anchored barcode system according to any one of claims 1 to 8 is used to construct a sequencing library.

11. A sequencing adaptor comprising a position-anchored barcode system of any one of claims 1 to 8.

12. A complex attached to the position-anchored barcode system of any one of claims 1 to 8.

13. A composition comprising the position-anchored barcode system of any one of claims 1 to 8.

14. A kit for nanopore sequencing pooling comprising the position-anchored barcode system of any one of claims 1-8, or the sequencing adaptor of claim 11.

15. Use of the position-anchored bar code system according to any of claims 1 to 8, wherein said use is any of the following:

1) the application in improving the classification accuracy of sequencing samples;

2) use in reducing false positives for sequencing sample classification;

3) the application in the construction of sequencing libraries;

4) application in sequencing.

Technical Field

The invention relates to the field of gene sequencing, in particular to a position anchoring bar code system for nanopore sequencing library building.

Background

At present, clinical infection patients are numerous and the infection sources are various worldwide, and in China, infectious diseases even account for 49 percent of the total disease of all the diseases. The conventional clinical diagnosis method is to determine the infection source of symptoms through the empirical judgment of doctors and microscopic examination, biochemical analysis and the like, but the limitations of human factors, detection period and detection range easily cause false detection and missed judgment, and are particularly not favorable for diagnosis and treatment of acute infection. With the rapid development of high-throughput sequencing and genomics, the metagenomic sequencing technology can rapidly, comprehensively and objectively identify the composition of microorganisms in a sample, and is increasingly and widely applied to the detection of infectious pathogenic microorganisms in the field of infection diagnosis, so that a more accurate diagnosis basis is provided for clinical decision and subsequent medication.

The Illumina second generation sequencing is well developed in China, but the following problems exist when the Illumina second generation sequencing is applied to microbial detection: firstly, the reading length of the second-generation sequencing is below several hundred bp, and higher homologous sequences exist among different species of microorganisms, so that the accuracy of metagenomic species analysis is poor, irrelevant microorganism information is fed back in a data report, and a doctor is caused to have greater diagnosis interference; secondly, the identification of more deep pathogenic genes and drug-resistant genes requires assembling and splicing of sequencing sequences, so that complex analysis requires higher time and capital cost to make up for the reading length defect of second-generation sequencing data; in addition, instruments related to the second-generation sequencing are expensive, complex to operate, high in early-stage investment and long in whole sequencing time, and are difficult to meet the requirement of acute infection. The third generation sequencing technology PacBio is greatly improved in sequencing read length, can detect long fragment data of 8-12kb, even 40-70kb, but has the defect that the library building process is complex. Moreover, the method has the defect of long sequencing period as the second-generation sequencing, and after one round of sequencing is finished, dozens of hours are needed to finish off-line data, and the quick identification of pathogenic microorganisms is difficult to meet due to the subsequent analysis time.

The nanopore sequencing technology just makes up the disadvantages of other sequencing platforms, so that the sequencing fragment has long reading length, and the library building and sequencing time are short. In addition, the equipment is small and portable, data generation and letter generation analysis can be carried out in real time, and the limitation of a sequencing site and the delay of report feedback are perfectly solved. Therefore, the technology is very suitable for the analysis and identification of clinical infectious microbial pathogens. However, the upper computer chip for nanopore sequencing is very expensive and the price is not user-friendly. The use of barcode sequence (barcode) information to resolve multiple samples is a common cost-effective strategy for high-throughput DNA sequencing. And (3) introducing a unique bar code sequence into each DNA sample in the library building process, and after a plurality of bar code DNA samples are sequenced by the same flow cell at the same time, classifying and distinguishing different computer-mounted samples according to the bar code sequences. Expensive chips in nanopore sequencing technology make multiple sample machines have obvious economic advantages, allowing users to amortize the fixed cost of one flow cell. A series of kits released by Oxford nanopore company provide 12 different barcodes with the length of 24bp, and the barcodes are connected to two ends of a sample DNA sequence in the library building process and then sequenced on a machine, so that one chip can simultaneously obtain sequence information of 12 different samples. However, when samples are distinguished according to a bar code sequence in the follow-up process, the phenomenon of confusion of the bar code with the length of 24bp carried by the library construction kit is serious, and the reason is that the error rate of single base in reads can reach 10-15% in the process of converting a current signal into a base (namely, basefilling), so that when the reads are classified according to the bar code sequence in downstream data analysis, cross contamination of data among the samples can be caused due to error in bar code identification, so that the false positive identification of microorganisms is caused, and great trouble is brought to clinical decision.

The present invention has been made based on this.

Disclosure of Invention

The invention aims to solve the technical problem of improving the accuracy of the existing nanopore sequencing data sample bar code comparison process.

Considering that the sample bar code comparison of the nanopore sequencing platform often generates errors, the subsequent data processing flow is greatly influenced. According to the method, through deep excavation of a large amount of data, errors occurring during sequence alignment of the nanopore sequencing platform are classified and statistically analyzed, and the influence of different error rate types on sequence identification is quantified. It was surprisingly found that sequencing errors of the Indel type (Indel) greatly increase the error rate of sequence identification, while the base Mismatch type (Mismatch) has less influence on the increase of the error rate of sequence identification, so that the influence on the accuracy increase is limited when the length of a barcode is increased in the design of a sample barcode, and the accuracy increase is larger when a position-anchored sequence is added to the barcode with the same length. Based on the discovery, the invention constructs a set of position anchoring bar code system containing a position anchoring sequence, verifies the library construction kit SQK-PBK004 of a nanopore company, performs library construction and machine operation on 10 pure bacteria, and performs classification comparison on off-machine data through the original bar code system and the position anchoring bar code system respectively, and the result shows that the position anchoring bar code system has better sample classification accuracy and is improved by more than 3 orders of magnitude compared with the original bar code system.

Therefore, the first objective of the present invention is to provide a position-anchored barcode system for improving the accuracy of sample nanopore sequencing resolution.

The second purpose of the invention is to provide a preparation method and application of the position anchoring bar code system.

In order to achieve the purpose, the invention provides the following technical scheme:

the invention provides a position anchoring bar code system for nanopore sequencing library building, which is characterized by comprising the following structures:

[BARCODE-ANCHOR]n-BARCODEn+1

wherein n is more than or equal to 1,

the BARCODE is a bar code sequence,

the ANCHOR is an ANCHOR sequence.

In some embodiments, the system includes the structure F L ANK1- [ BARCODE-ANCHOR]n-

BARCODEn+1-FLANK2,

The F L ANK is a flanking sequence,

in some embodiments, 1 ≦ n ≦ 10; preferably, n is 1, 2, 3.

In some embodiments, the BARCODE sequences are the same or different; preferably, the BARCODE sequences are different;

in some embodiments, the ANCHOR sequences are the same or different; preferably, the ANCHOR sequences are different.

In some embodiments, the ANCHOR sequence is 5-50bp in length; preferably, the length of the ANCHOR sequence is 10-35 bp;

in some embodiments, the ANCHOR sequence has < 70% homology to the BARCODE sequence; preferably < 50%.

In some embodiments, the F L ANK sequence is 10-30bp in length, preferably the F L ANK sequence is 15-25bp in length;

in some embodiments, the position-anchored barcode system for nanopore sequencing pooling comprises any one of the following structures:

FLANK1-BARCODE1-ANCHOR1-BARCODE2-FLANK2;

FLANK1-BARCODE1-ANCHOR1-BARCODE2-ANCHOR2-BARCODE3-FLANK2;

FLANK1-BARCODE1-ANCHOR1-BARCODE2-ANCHOR2-BARCODE3-ANCHOR3-

BARCODE5-FLANK2;

in some embodiments, the ANCHOR sequences are different or the same, preferably the ANCHOR sequences are different;

in some embodiments, the BARCODE sequences are different or the same, preferably the BARCODE sequences are different.

The invention also provides a preparation method of the position anchoring bar code system for nanopore sequencing library building, which is characterized by comprising the following steps of: the method comprises directly synthesizing the nucleotide sequence of the position anchoring bar code system, or preparing the position anchoring bar code system by connecting after segmented synthesis.

In some embodiments of the invention, when preparing a position-anchored barcode system comprising an existing barcode linker, the preparation is as follows: on the basis of the existing nanopore sequencing library building bar code, the bridging primer is utilized to realize the series connection of the existing bar code joint and the designed bar code joint; preferably, the bridging primer sequence is ANCHOR in a position-anchored barcode system structure.

In some embodiments of the invention, the existing barcode linker is derived from the original barcode of the SQK-PBK004 kit from ONT corporation.

The invention also provides application of the position anchoring bar code system for nanopore sequencing library building in improving sequencing sample classification accuracy.

The invention also provides application of the position anchoring bar code system for nanopore sequencing library building in reduction of false positive of sequencing sample classification.

The invention also provides application of the position anchoring bar code system for nanopore sequencing library construction in sequencing library construction.

The invention also provides application of the position anchoring bar code system for nanopore sequencing library building in sequencing.

The invention also provides a method for constructing the sequencing library, which is characterized in that the sequencing library is constructed by utilizing the position anchoring bar code system for constructing the nanopore sequencing library.

The invention also provides a sequencing joint, which is characterized in that the sequencing joint sequence comprises the position anchoring bar code system.

The invention also provides a composite, wherein the composition is attached to the position-anchored barcode system described above.

The invention also provides a composition, which is characterized by comprising the position anchoring bar code system.

The invention also provides a kit for nanopore sequencing library building, which is characterized by comprising the position anchoring bar code system or the sequencing joint.

The invention has the beneficial technical effects that:

1) the invention proves that the insertion deletion type error is the main reason of the whole sequence alignment error for the first time, and compared with the basic group mismatching type error, the influence of the basic group mismatching type error on the whole sequence alignment error is small. In practice, the invention limits the error expansion of indel types to the whole comparison result by introducing an anchoring sequence into a bar code system, greatly reduces the comparison score reduction value caused by insertion deletion, screens out remote bar code interference and achieves accurate bar code resolution; compared with the method of only improving the length of the barcode sequence, although the increase of the length of the barcode can properly reduce the sample classification errors caused by base mismatch, the improvement of the accuracy of the whole sequence comparison result is very limited, and the position anchoring barcode system has extremely remarkable effect on the improvement of the result accuracy.

2) The invention designs F L ANK1-BARCODE based on a nanopore platform SQK-PBK004 library construction process by skillfully utilizing a self-carried bar code to connect an independently developed bar code sequence and utilizing a connecting part sequence as an anchoring sequence1-ANCHOR2-BARCODE2A position-anchored barcode system of the type F L ANK2, which improves the classification accuracy from 0.999 to 0.999999 when resolving different samples.

3) In practical application, the position anchoring bar code system can design bar codes with different lengths and anchoring sequence numbers according to different requirements, and realizes the balance of classification accuracy and microorganism detection rate of different requirements.

4) The position anchoring bar code system has better resolution, higher accuracy, reduced false positive identification, improved nanopore sequencing precision and reduced sequencing cost, and is suitable for popularization and application.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 error rate statistics based on sequencing data of a kit barcode system; FIG. A shows the average error rate and median error rate for each site in the actual sequencing of 10 sets of kit barcode linker sequences; FIG. B shows the average error rate and median error rate of three types of errors, i.e., insertion, deletion and mismatch, of each site alignment in the actual sequencing of 10 sets of kit barcode linker sequences; FIG. C shows the correspondence between barcode linker sites of the kit and errors in the comparison; graph D shows the sub-class summary of the error types of the comparison of the bar code joint of the kit, which shows the distribution of the error types of the comparison that occurs under different sites, the abscissa is the base position of the bar code sequence, the ordinate is the error type, the color shade of the lattice in the graph shows the error rate of the error type at the site, the darker the color indicates the higher the error rate, the color Block of the Block annotation area shows different elements of the joint sequence, and the error type is clustered and analyzed by Euclidean distance;

FIG. 2. Effect of different alignment error types on overall sequence classification accuracy; graph a shows a total error rate of 8%; graph B represents a total error rate of 16%;

fig. 3. effect of not containing ANCHOR sequences (ANCHOR 0) and containing 1 ANCHOR sequence (ANCHOR 1), containing 2 ANCHOR sequences (ANCHOR 2) on overall accuracy of the barcode alignment;

FIG. 4 is a schematic diagram of the original and optimized library building process;

FIG. 5 is a sample classification accuracy comparison of the original barcode system and the position-anchored barcode system, with the shaded portion indicating the result of accurate classification and the other results being the result of misclassification.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Definition of partial terms

Unless defined otherwise below, all technical and scientific terms used in the detailed description of the present invention are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present invention.

As used herein, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of …" is considered to be a preferred embodiment of the term "comprising". If in the following a certain group is defined to comprise at least a certain number of embodiments, this should also be understood as disclosing a group which preferably only consists of these embodiments.

The terms "about" and "substantially" in the present invention denote an interval of accuracy that can be understood by a person skilled in the art, which still guarantees the technical effect of the feature in question. The term generally denotes a deviation of ± 10%, preferably ± 5%, from the indicated value.

Where an indefinite or definite article is used when referring to a singular noun e.g. "a" or "an", "the", this includes a plural of that noun.

Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

The following terms or definitions are provided only to aid in understanding the present invention. These definitions should not be construed to have a scope less than understood by those skilled in the art.

Some technical terms in the present invention are explained as follows:

the position anchoring bar code system comprises a plurality of bar code sequencing label systems connected in series by two or more bar code sequences, wherein the bar codes are anchored by specific ANCHOR sequences, and the system can be applied to construction of a sequencing library in nanopore sequencing, can improve the classification accuracy of a sequencing sample and reduce the application of false positive in sequencing sample classification. The specific structure of the [ BARCODE-ANCHOR ] can be described by the invention]n-BARCODEn+1Wherein, in the step (A),n is more than or equal to 1, the BARCODE is a BARCODE sequence, and the ANCHOR is an anchoring sequence. It is understood that any composition, composite, or system, etc., comprising the above-described structures is within the scope of the present invention. Although the present invention is explained by taking the bar code of SQK-PBK004 kit as an example in the prior art, it is only an exemplary illustration and is not a limitation of the present invention. The invention has been verified by specific theory of biological theory analysis and wet experiment, and proves that any inclusion [ BARCODE-ANCHOR]n-BARCODEn+1The structured BARCODE system can be used for constructing a sequencing library, can improve the classification accuracy of a sequencing sample and reduce the false positive of the classification of the sequencing sample, and in some preferred embodiments of the invention, the structure of the position-anchored BARCODE system can be F L ANK1- [ BARCODE-ANCHOR]n-BARCODEn+1F L ANK2, wherein the F L ANK is a linker sequence, and the linker sequence is a conventional module for sequencing library construction, and the addition of the module is understood in the art, and F L ANK1 and 2 may be identical or different in sequence according to actual needs.

In view of the fact that a large amount of data is deeply mined, errors occurring during sequence alignment of a nanopore sequencing platform are classified and statistically analyzed, the influence of different error rate types on sequence identification is quantified, it is found that sequencing errors of insertion deletion types (indels) can greatly improve the error rate of sequence identification, and the influence of base Mismatch types (mismatches) on the improvement of the error rate of sequence identification is small, so that the influence of improving the length alignment accuracy of barcodes in the design of sample barcodes is limited. In addition, the problem of sequence length is also confirmed in some embodiments of the present invention, for example, in example 2, it is mentioned that "when 0.16 base mismatch type error is introduced to achieve 99.99% of the overall alignment accuracy, the barcode length only needs to reach 40 bp; and when 0.16 insertion deletion type error is introduced, the length of the barcode needs to reach 80bp ". It can be seen that the length of the position-anchored barcode system of the present invention is appropriately selected according to practical needs in the art, such as 1 ≦ n ≦ 10 in some embodiments of the present invention, such as n ≦ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; preferably, n is 1, 2 or 3.

It is understood that the sequences of BARCODE as a marker sequence in sequencing may be the same or different in the position-anchored BARCODE system of the present invention; in some preferred embodiments, the BARCODE sequences are different. Also, the ANCHOR sequences serve as ANCHOR components, and the sequences may be the same or different, and in some preferred embodiments, the ANCHOR sequences are different. In addition, the length of the ANCHOR sequence may be known in the art, and may be, for example, 5-50bp, and in some preferred embodiments, the length of the ANCHOR sequence is 10-35 bp.

The ANCHOR sequence as the anchoring component of BARCODE, the sequence of which should be distinguished from the BARCODE sequence, without particular limitation, the homology of the ANCHOR sequence with the BARCODE sequence may be < 80%, < 70%, < 60%, < 50%, < 40%, < 30%, < 20%, < 10%; in some preferred embodiments, the homology is < 50%.

It will be appreciated that as some exemplary position-anchored barcode systems of the present invention, the structure thereof may be embodied as follows:

FLANK1-BARCODE1-ANCHOR1-BARCODE2-FLANK2;

FLANK1-BARCODE1-ANCHOR1-BARCODE2-ANCHOR2-BARCODE3-FLANK2;

FLANK1-BARCODE1-ANCHOR1-BARCODE2-ANCHOR2-BARCODE3-ANCHOR3-

BARCODE4-FLANK2;

FLANK1-BARCODE1-ANCHOR1-BARCODE2-ANCHOR2-BARCODE3-ANCHOR3-

BARCODE4-ANCHOR4-BARCODE5-FLANK2;

……

the 'bar code joint' of the invention refers to a complete section containing a bar code sequence and flanking sequences at two ends. For example, in the present embodiment, the autonomously designed barcode linker is defined as a BBRCD linker, and the barcode linker in the original kit SQK-PBK004 is defined as an ABRCD linker.

The 'barcode sequence' refers to a specific sequence of a barcode, which is contained in a barcode linker and is a part of the barcode linker. For example, in the embodiment of the present invention, the self-designed barcode sequence is defined as BBRCD, and the barcode sequence in the original kit SQK-PBK004 is defined as ABRCD.

"Anchor sequence" (ANCHOR) as used herein refers to a nucleotide sequence for anchoring BARCODE, which may be of any length known in the art, such as 5-50bp, and in some preferred embodiments, 10-35 bp; the sequences thereof should be distinguished from BARCODE sequences without particular limitation, and the homology of ANCHOR sequences to BARCODE sequences may be < 80%, < 70%, < 60%, < 50%, < 40%, < 30%, < 20%, < 10%; in some preferred embodiments, the homology is < 50%. Exemplary, such ANCHOR sequences as mentioned in the examples of the present invention are SEQ ID NO.50, SEQ ID NO.51 and SEQ ID NO.13, etc.

The term "F L ANK" as used herein refers to flanking sequences at both ends of a barcode system, which are conventional components of sequencing barcode linkers, such as for a nanopore sequencing platform, F L ANK1 is a Y-type sequencing linker linked to a motor protein, which ensures that DNA can be sequenced normally through a nanopore, F L ANK2 is used for linking sequencing sample sequences, and the length of the sequencing sample sequences can be any verification length known in the art, such as 10-30bp, and in some preferred embodiments, can be 15-25 bp. exemplary, such as F L ANK sequences SEQ ID NO.16 and SEQ ID NO.26 mentioned in the examples of the present invention.

The invention is further described by the accompanying drawings and the following examples, which are intended to illustrate specific embodiments of the invention and are not to be construed as limiting the scope of the invention in any way. Unless otherwise indicated, the experimental procedures disclosed in the present invention are performed by conventional techniques in the art, and the reagents and raw materials used in the examples are commercially available.

32页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种高通量低成本的微量生物样品分子鉴定技术

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!