Decoding method for protein identification

文档序号:1078410 发布日期:2020-10-16 浏览:24次 中文

阅读说明:本技术 用于蛋白质鉴定的解码方法 (Decoding method for protein identification ) 是由 苏贾尔·M·帕特尔 帕拉格·马利克 贾勒特·D·艾格特森 于 2018-12-28 设计创作,主要内容包括:提供了用于蛋白质的准确且有效鉴定和定量的方法和系统。在一方面,本文公开了一种鉴定未知蛋白质样品中的蛋白质的方法,其包括接收对所述未知蛋白质进行的多个经验测量的信息;将所述经验测量的信息与包含多个蛋白质序列的数据库进行比较,每个蛋白质序列对应于多种候选蛋白质中的候选蛋白质;以及对于所述多种候选蛋白质中的一种或多种候选蛋白质中的每一种,基于所述经验测量的信息与所述数据库的比较,生成所述候选蛋白质生成所述经验测量的信息的概率,在假定所述样品中存在所述候选蛋白质的情况下未观察到所述多个经验测量的概率,或者所述样品中存在所述候选蛋白质的概率。(Methods and systems for accurate and efficient identification and quantification of proteins are provided. In one aspect, disclosed herein is a method of identifying a protein in an unknown protein sample, comprising receiving information of a plurality of empirical measurements performed on the unknown protein; comparing the empirically measured information to a database comprising a plurality of protein sequences, each protein sequence corresponding to a candidate protein of a plurality of candidate proteins; and for each of one or more candidate proteins in the plurality of candidate proteins, generating a probability that the candidate protein generated the empirically measured information, a probability that the plurality of empirically measurements were not observed assuming the candidate protein was present in the sample, or a probability that the candidate protein was present in the sample based on a comparison of the empirically measured information to the database.)

1. A computer-implemented method of identifying a protein in an unknown protein sample, the method comprising:

(a) receiving, by the computer, information of a plurality of empirical measurements made on the unknown protein in the sample;

(b) comparing, by the computer, at least a portion of the information of the plurality of the empirical measurements to a database comprising a plurality of protein sequences, each protein sequence corresponding to a candidate protein of a plurality of candidate proteins; and

(c) for each of one or more candidate proteins in the plurality of candidate proteins, generating, by the computer, based on the comparison of the at least a portion of the information of the plurality of the empirical measurements to the database comprising the plurality of protein sequences, one or more of:

(i) a probability that said candidate protein generates said information of said plurality of empirical measurements,

(ii) assuming that the candidate protein is present in the sample, the probability that the plurality of empirical measurements is not observed, an

(iii) A probability of the candidate protein being present in the sample.

2. The method of claim 1, wherein two or more of the plurality of empirical measurements are selected from the group consisting of:

(i) a measure of binding of each of one or more affinity reagent probes to the unknown protein in the sample, each affinity reagent probe configured to selectively bind to one or more candidate proteins in the plurality of candidate proteins;

(ii) the length of one or more of the unknown proteins in the sample;

(iii) hydrophobicity of one or more of the unknown proteins in the sample; and

(iv) the isoelectric point of one or more of the unknown proteins in the sample.

3. The method of claim 1, wherein generating the plurality of probabilities further comprises receiving additional information of a binding measurement for each of a plurality of additional affinity reagent probes, each additional affinity reagent probe configured to selectively bind to one or more candidate proteins of the plurality of candidate proteins.

4. The method of claim 1, further comprising generating, for each of the one or more candidate proteins, a confidence level that the candidate protein matches one of the unknown proteins in the sample.

5. The method of claim 1, wherein the plurality of affinity reagent probes comprises no more than 50 affinity reagent probes.

6. The method of claim 1, wherein the plurality of affinity reagent probes comprises no more than 100 affinity reagent probes.

7. The method of claim 1, wherein the plurality of affinity reagent probes comprises no more than 200 affinity reagent probes.

8. The method of claim 1, wherein the plurality of affinity reagent probes comprises no more than 300 affinity reagent probes.

9. The method of claim 1, wherein the plurality of affinity reagent probes comprises no more than 500 affinity reagent probes.

10. The method of claim 1, wherein the plurality of affinity reagent probes comprises more than 500 affinity reagent probes.

11. The method of claim 1, further comprising generating a paper or electronic report identifying the protein in the sample.

12. The method of claim 1, wherein the sample comprises a biological sample.

13. The method of claim 12, wherein the biological sample is obtained from a subject.

14. The method of claim 13, further comprising determining a disease state in the subject based at least on the plurality of probabilities.

15. The method of claim 1, wherein (c) comprises, for each of one or more candidate proteins in the plurality of candidate proteins, generating by the computer (i) the probability that the candidate protein generated the plurality of empirically measured information.

16. The method of claim 1, wherein (c) comprises generating, by the computer, for each of one or more candidate proteins in the plurality of candidate proteins, (ii) the probability that the plurality of empirical measurements was not observed given the presence of the candidate protein in the sample.

17. The method of claim 1, wherein (c) comprises, for each of one or more candidate proteins in the plurality of candidate proteins, (iii) generating, by the computer, the probability that the candidate protein is present in the sample.

18. The method of claim 15, wherein the measurement comprises binding of an affinity reagent probe.

19. The method of claim 15, wherein the measurement comprises non-specific binding of an affinity reagent probe.

20. The method of claim 16, wherein the measurement comprises binding of an affinity reagent probe.

21. The method of claim 16, wherein the measurement comprises non-specific binding of an affinity reagent probe.

22. The method of claim 17, wherein the empirical measurement comprises binding of affinity reagent probes.

23. The method of claim 17, wherein the empirical measurement comprises non-specific binding of affinity reagent probes.

24. The method of claim 1, further comprising generating a sensitivity of protein identification with a predetermined threshold.

25. The method of claim 24, wherein the predetermined threshold is less than 1% incorrect.

26. The method of claim 1, wherein the protein in the sample is truncated or degraded.

27. The method of claim 1, wherein the proteins in the sample are not derived from protein termini.

28. The method of any one of claims 15-17, wherein the empirical measurements comprise the length of one or more of the unknown proteins in the sample.

29. The method of any one of claims 15-17, wherein the empirical measurement comprises hydrophobicity of one or more of the unknown proteins in the sample.

30. The method of any one of claims 15-17, wherein the empirical measurement comprises the isoelectric point of one or more of the unknown proteins in the sample.

31. The method of claim 1, wherein the empirical measurements comprise measurements made on a mixture of antibodies.

32. The method of claim 1, wherein the empirical measurements comprise measurements made on samples obtained from a plurality of species.

33. The method of claim 1, wherein the empirical measurements comprise measurements taken on a sample in the presence of single amino acid variations (SAV) caused by non-synonymous Single Nucleotide Polymorphisms (SNPs).

Background

Current techniques for protein identification typically rely on the binding and subsequent readout of highly specific and sensitive affinity reagents (e.g., antibodies), or on peptide reads (typically about 12-30 amino acids in length) from a mass spectrometer. Such techniques can be applied to unknown proteins in a sample to determine the presence, absence or amount of a candidate protein based on analysis of binding measurements of highly specific and sensitive affinity reagents to the protein of interest.

Disclosure of Invention

There is recognized herein a need to improve the identification and quantification of proteins in unknown protein samples. The methods and systems provided herein can significantly reduce or eliminate errors in identifying proteins in a sample, thereby improving the quantification of the proteins. Such methods and systems can enable accurate and efficient identification of candidate proteins within unknown protein samples. Such identification may be based on calculations using information such as binding measurements of affinity reagent probes configured to selectively bind to one or more candidate proteins, protein length, protein hydrophobicity, and isoelectric point. In some embodiments, a sample of an unknown protein may be exposed to individual affinity reagent probes, pooled affinity reagent probes, or a combination of individual affinity reagent probes and pooled affinity reagent probes. The identifying may comprise estimating a confidence level for the presence of each of the one or more candidate proteins in the sample.

The methods and systems provided herein can include algorithms for identifying proteins based on a series of experiments performed on a completely intact protein or protein fragment. Each experiment may be an empirical measurement of a protein and may provide information that may be used to identify the protein. Examples of experiments include the measurement of binding to affinity reagents (e.g., antibodies or aptamers), protein length, protein hydrophobicity, and isoelectric point. Information about the experimental results may be used to calculate the probability or likelihood of a protein candidate and/or to infer the identity of the protein by selecting a protein from a list of protein candidates that maximizes the likelihood of an observed experimental result. The methods and systems provided herein can also include a collection of protein candidates, as well as algorithms to calculate the probability that the experimental result is from each of these protein candidates.

In one aspect, the present disclosure provides a computer-implemented method of identifying a protein in an unknown protein sample, the method comprising: (a) receiving, by the computer, information of a plurality of empirical measurements made on the unknown protein in the sample; (b) comparing, by the computer, at least a portion of the information of the plurality of the empirical measurements to a database comprising a plurality of protein sequences, each protein sequence corresponding to a candidate protein of a plurality of candidate proteins; and (c) for each of one or more candidate proteins in the plurality of candidate proteins, generating, by the computer, based on the comparison of the at least a portion of the information of the plurality of the empirical measurements to the database comprising the plurality of protein sequences, one or more of: (i) a probability that the candidate protein generates the information for the plurality of empirical measurements, (ii) a probability that the plurality of empirical measurements was not observed assuming the candidate protein was present in the sample, and (iii) a probability that the candidate protein was present in the sample.

In some embodiments, two or more of the plurality of empirical measurements are selected from: (i) a measure of binding of each of one or more affinity reagent probes to the unknown protein in the sample, each affinity reagent probe configured to selectively bind to one or more candidate proteins in the plurality of candidate proteins; (ii) the length of one or more of the unknown proteins in the sample; (iii) hydrophobicity of one or more of the unknown proteins in the sample; and (iv) the isoelectric point of one or more of the unknown proteins in the sample.

In some embodiments, generating the plurality of probabilities further comprises receiving additional information of a binding measurement for each of a plurality of additional affinity reagent probes, each additional affinity reagent probe configured to selectively bind to one or more candidate proteins of the plurality of candidate proteins. In some embodiments, the method further comprises generating, for each of the one or more candidate proteins, a confidence level that the candidate protein matches one of the unknown proteins in the sample.

In some embodiments, the plurality of affinity reagent probes comprises no more than 50 affinity reagent probes. In some embodiments, the plurality of affinity reagent probes comprises no more than 100 affinity reagent probes. In some embodiments, the plurality of affinity reagent probes comprises no more than 200 affinity reagent probes. In some embodiments, the plurality of affinity reagent probes comprises no more than 300 affinity reagent probes. In some embodiments, the plurality of affinity reagent probes comprises no more than 500 affinity reagent probes. In some embodiments, the plurality of affinity reagent probes comprises more than 500 affinity reagent probes. In some embodiments, the method further comprises generating a paper or electronic report identifying the protein in the sample.

In some embodiments, the sample comprises a biological sample. In some embodiments, the biological sample is obtained from a subject. In some embodiments, the method further comprises determining a disease state in the subject based at least on the plurality of probabilities.

In some embodiments, (c) comprises for each of one or more candidate proteins in the plurality of candidate proteins, generating, by the computer, (i) the probability that the candidate protein generated the information of the plurality of empirical measurements. In some embodiments, (c) comprises for each of one or more candidate proteins in the plurality of candidate proteins, generating by the computer (ii) the probability that the plurality of empirical measurements was not observed given the presence of the candidate protein in the sample. In some embodiments, (c) comprises for each of one or more candidate proteins in the plurality of candidate proteins, generating (iii), by the computer, the probability that the candidate protein is present in the sample. In some embodiments, the measurement comprises binding of an affinity reagent probe. In some embodiments, the measurement comprises non-specific binding of the affinity reagent probe. In some embodiments, the measurement comprises binding of an affinity reagent probe. In some embodiments, the measurement comprises non-specific binding of the affinity reagent probe. In some embodiments, the empirical measurement comprises binding of affinity reagent probes. In some embodiments, the empirical measurement comprises non-specific binding of affinity reagent probes.

In some embodiments, the method further comprises generating a sensitivity of protein identification with a predetermined threshold. In some embodiments, the predetermined threshold is less than 1% incorrect. In some embodiments, the protein in the sample is truncated or degraded. In some embodiments, the protein in the sample is not derived from protein termini.

In some embodiments, the empirical measurement comprises the length of one or more of the unknown proteins in the sample. In some embodiments, the empirical measurement comprises the hydrophobicity of one or more of the unknown proteins in the sample. In some embodiments, the empirical measurement comprises the isoelectric point of one or more of the unknown proteins in the sample. In some embodiments, the empirical measurements comprise measurements performed on a mixture of antibodies. In some embodiments, the empirical measurements comprise measurements performed on samples obtained from a plurality of species. In some embodiments, the empirical measurements comprise measurements taken on a sample in the presence of single amino acid variations (SAV) caused by non-synonymous Single Nucleotide Polymorphisms (SNPs).

Other aspects and advantages of the present disclosure will become apparent to those skilled in the art from the following detailed description, which shows and describes only illustrative embodiments of the disclosure. As will be realized, the disclosure is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Is incorporated by reference

All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. If publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.

Drawings

The novel features believed characteristic of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also referred to herein as "figures"), of which:

fig. 1 shows an exemplary flow diagram for protein identification of unknown proteins in a biological sample, according to the disclosed embodiments.

FIG. 2 shows the sensitivity (e.g., percentage of substrate identified at a False Detection Rate (FDR) of less than 1%) of affinity reagent probes plotted against the number of probe recognition sites (e.g., trimer binding epitopes) in the affinity reagent probes (ranging up to 100 probe recognition sites or trimer binding epitopes) for three different experimental scenarios (using 50, 100, and 200 probes, respectively, represented by gray, black, and white circles) according to the disclosed embodiments.

FIG. 3 shows the sensitivity (e.g., percentage of substrate identified at a False Detection Rate (FDR) of less than 1%) of affinity reagent probes plotted against the number of probe recognition sites (e.g., trimer binding epitope) in the affinity reagent probe (ranging up to 700 probe recognition sites or trimer binding epitopes) for three different experimental scenarios (using 50, 100, and 200 probes, respectively, represented by gray, black, and white circles) according to the disclosed embodiments.

Figure 4 shows a graph showing the sensitivity of protein identification for experiments using 100 (left), 200 (center), or 300 probes (right) according to the disclosed embodiments.

Figure 5 shows a graph showing the sensitivity of protein identification for experiments using various protein fragmentation methods. In each of the top and bottom rows, protein identification performance is shown according to the disclosed embodiments as measured with 50, 100, 200, and 300 affinity reagents (in 4 panels from left to right), with maximum fragment length values of 50, 100, 200, 300, 400, and 500 (represented by hexagons, downward-facing triangles, upward-facing triangles, diamonds, rectangles, and circles, respectively).

Figure 6 shows a graph showing the sensitivity of human protein identification (percentage of substrate identified at less than 1% FDR) for experiments using various combinations of measurement types, according to the disclosed embodiments.

FIG. 7 shows a graphical representation showing the sensitivity of protein identification for experiments using 50, 100, 200, or 300 affinity reagent probes (represented by circles, triangles, and squares, respectively) for unknown proteins from E.coli, yeast, or human, according to the disclosed embodiments.

Fig. 8 shows a graph showing binding probability (y-axis, left) and sensitivity of protein identification (y-axis, right) with respect to iteration (x-axis) according to the disclosed embodiments.

Fig. 9 shows that a comparison of the estimated false positive rate to the actual false positive rate for a simulated 200 probe experiment demonstrates an accurate false positive rate estimation according to the disclosed embodiments.

FIG. 10 illustrates a computer control system programmed or otherwise configured to implement the methods provided herein.

FIG. 11 shows the performance of the truncated (censored) protein identification versus the non-truncated (unpercored) protein identification method.

FIG. 12 shows the tolerance of the truncated protein identification and the non-truncated protein identification methods to random "false negative" binding results.

Figure 13 shows the tolerance of the truncated protein identification and the non-truncated protein identification methods to random "false positive" binding results.

FIG. 14 shows the performance of truncated protein identification and non-truncated protein identification methods using either overestimated or underestimated affinity agent binding probabilities.

Figure 15 shows the performance of truncated protein identification and non-truncated protein identification methods using affinity reagents with unknown binding epitopes.

Figure 16 shows the performance of truncated protein identification and non-truncated protein identification methods using affinity reagents with missing binding epitopes.

Figure 17 shows the performance of truncated protein identification and non-truncated protein identification methods using affinity reagents for the first 300 most abundant trimers in the proteome, 300 randomly selected trimers in the proteome, or 300 least abundant trimers in the proteome.

FIG. 18 shows the performance of truncated protein identification and non-truncated protein identification methods using affinity reagents with random or biologically similar off-target sites.

Figure 19 shows the performance of the truncated protein identification and non-truncated protein identification methods using a set of optimal affinity reagents (probes).

FIG. 20 shows the performance of truncated protein identification and non-truncated protein identification methods using an unmixed candidate affinity reagent and a mixture of candidate affinity reagents.

Figure 21 shows the enhancement of binding between an affinity reagent and a protein by two hybridization steps, according to some embodiments.

Figure 22 shows protein identification performance using a set of reagents for selective modification and detection of 4 amino acids (K, D, C and W), according to some embodiments.

Figure 23 shows protein identification performance using a set of reagents for selective modification and detection of 20 amino acids (R, H, K, D, E, S, T, N, Q, C, G, P, A, V, I, L, M, F, Y and W), according to some embodiments.

Figure 24 shows performance of protein identification using measurement of amino acid sequences, according to some embodiments, where all amino acids are measured with the probability of detection shown on the x-axis (equal to reaction efficiency) and the y-axis represents the percentage of protein in a sample identified at a false discovery rate of less than 1%.

Detailed Description

While various embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.

The term "sample" as used herein generally refers to a biological sample (e.g., a sample containing proteins). The sample may be taken from a tissue or cell, or from the environment of a tissue or cell. In some examples, the sample may comprise or be derived from a tissue biopsy, blood, plasma, extracellular fluid, dried blood spots, cultured cells, culture medium, waste tissue, plant matter, synthetic proteins, bacterial and/or viral sample, fungal tissue, archaea, or protozoa. The sample may have been isolated from the source prior to collection. The sample may contain forensic evidence. Non-limiting examples include fingerprints, saliva, urine, blood, feces, semen or other bodily fluids that are isolated from a primary source prior to collection. In some examples, proteins are isolated from their primary source (cells, tissues, bodily fluids such as blood, environmental samples, etc.) during sample preparation. The sample may be derived from a dead species, including but not limited to a fossil-derived sample. The protein may or may not be purified or otherwise enriched from its primary source. In some cases, the primary source is homogenized prior to further processing. In some cases, the cells are lysed using a buffer, such as RIPA buffer. Denaturing buffers may also be used at this stage. The sample may be filtered or centrifuged to remove lipids and particulate matter. The sample may also be purified to remove nucleic acids, or may be treated with rnases and dnases. The sample may contain intact proteins, denatured proteins, protein fragments, or partially degraded proteins.

The sample may be taken from a subject having a disease or disorder. The disease or disorder can be an infectious disease, an immune disorder or disease, a cancer, a genetic disease, a degenerative disease, a lifestyle disease, an injury, a rare disease, or an age-related disease. The infectious disease may be caused by bacteria, viruses, fungi and/or parasites. Non-limiting examples of cancer include bladder cancer, lung cancer, brain cancer, melanoma, breast cancer, non-hodgkin's lymphoma, cervical cancer, ovarian cancer, colorectal cancer, pancreatic cancer, esophageal cancer, prostate cancer, renal cancer, skin cancer, leukemia, thyroid cancer, liver cancer, and uterine cancer. Some examples of genetic diseases or disorders include, but are not limited to, Multiple Sclerosis (MS), cystic fibrosis, Charcot-Marie-Tooth disease, Huntington's disease, Peutz-Jeghers syndrome, Down's syndrome, rheumatoid arthritis, and Tay-Sachs disease. Non-limiting examples of lifestyle diseases include obesity, diabetes, arteriosclerosis, heart disease, stroke, hypertension, cirrhosis, nephritis, cancer, Chronic Obstructive Pulmonary Disease (COPD), hearing problems, and chronic back pain. Some examples of injuries include, but are not limited to, abrasions, brain injuries, bruises, burns, concussions, congestive heart failure, architectural injuries, dislocations, flail chest, bone fractures, hemothorax, herniated disc, coxal bulge contusions, hypothermia, tears, nerve pinches, pneumothorax, rib fractures, sciatica, spinal cord injuries, tendon ligament fascia injuries, traumatic brain injuries, and whiplash injuries. The sample can be taken before and/or after treatment of a subject having a disease or disorder. Samples can be taken before and/or after treatment. Samples may be taken during a treatment or treatment regimen. Multiple samples may be taken from a subject to monitor the effect of treatment over time. Samples can be taken from subjects known or suspected to have infectious diseases for which diagnostic antibodies are not available.

A sample may be taken from a subject suspected of having a disease or disorder. Samples may be taken from subjects experiencing symptoms of unknown origin such as fatigue, nausea, weight loss, soreness and pain, weakness or memory loss. Samples may be taken from subjects with symptoms of definite cause. A sample may be taken from a subject at risk for developing a disease or disorder due to factors such as family history, age, environmental exposure, lifestyle risk factors, or the presence of other known risk factors.

The sample may be taken from an embryo, fetus or pregnant woman. In some examples, the sample may comprise proteins isolated from maternal plasma. In some examples, the protein is isolated from circulating fetal cells in maternal blood.

The sample may be taken from a healthy individual. In some cases, samples may be taken longitudinally from the same individual. In some cases, longitudinally taken samples may be analyzed for the purpose of monitoring the health status of an individual and early detection of health problems. In some embodiments, the sample may be collected in a home environment or point of care environment and subsequently transported by mail, courier, or other transport method prior to analysis. For example, a home user may collect a blood spot sample by finger prick, which may be dried and then shipped by mail before analysis. In some cases, longitudinally taken samples may be used to monitor responses to stimuli that are expected to affect health, motor performance, or cognitive performance. Non-limiting examples include response to drugs, diet, or exercise regimens.

The proteins of the sample may be treated to remove modifications that may interfere with epitope binding. For example, the protein may be subjected to an enzymatic treatment. For example, the protein may be subjected to a glycosidase treatment to remove post-translational glycosylation. Proteins may be treated with reducing agents to reduce disulfide bonds within the protein. The protein may be treated with a phosphatase to remove phosphate groups. Other non-limiting examples of post-translational modifications that can be removed include acetate groups, amide groups, methyl groups, lipids, ubiquitin, myristoylation, palmitoylation, prenylation or prenylation (e.g., farnesol and geranylgeraniol), farnesylation, geranylgeranylation, glycosylphosphatidylmyoxylation, lipidation, flavin moiety attachment, phosphopantetheinylation, and retinylidene schiff base formation.

The proteins of the sample may be manipulated by modifying one or more residues to make them more readily bound or detected by the affinity reagent. In some cases, the proteins of the sample may be treated to retain post-translational protein modifications that may facilitate or enhance epitope binding. In some examples, a phosphatase inhibitor may be added to the sample. In some examples, an oxidizing agent may be added to protect disulfide bonds.

The proteins of the sample may be completely or partially denatured. In some embodiments, the protein may be completely denatured. Proteins can be denatured by applying external stress such as detergents, strong acids or bases, concentrated inorganic salts, organic solvents (e.g., alcohols or chloroform), radiation, or heat. Proteins can be denatured by the addition of a denaturation buffer. Proteins may also be precipitated, lyophilized, and suspended in denaturing buffers. Proteins can be denatured by heating. Denaturation methods that are less likely to cause chemical modifications to the protein may be preferred.

The proteins of the sample may be treated to produce shorter polypeptides, either before or after conjugation. The remaining protein may be partially digested with an enzyme such as proteinase K to generate fragments, or may remain intact. In a further example, the protein may be exposed to a protease such as trypsin. Additional examples of proteases may include serine proteases, cysteine proteases, threonine proteases, aspartic proteases, glutamine proteases, metalloproteases, and asparagine peptide lyases.

In some cases, it may be useful to remove both very large and small proteins (e.g., adiponectin), for example, such proteins may be removed by filtration or other suitable methods. In some examples, the very large protein may include a protein of at least about 400 kilodaltons (kD), 450kD, 500kD, 600kD, 650kD, 700kD, 750kD, 800kD, or 850 kD. In some examples, very large proteins may include proteins of at least about 8,000 amino acids, about 8,500 amino acids, about 9,000 amino acids, about 9,500 amino acids, about 10,000 amino acids, about 10,500 amino acids, about 11,000 amino acids, or about 15,000 amino acids. In some examples, small proteins may include proteins less than about 10kD, 9kD, 8kD, 7kD, 6kD, 5kD, 4kD, 3kD, 2kD, or 1 kD. In some examples, small proteins may include proteins of less than about 50 amino acids, 45 amino acids, 40 amino acids, 35 amino acids, or about 30 amino acids. Very large or small proteins can be removed by size exclusion chromatography. Very large proteins can be separated by size exclusion chromatography, treated with a protease to produce a medium-sized polypeptide, and recombined with the medium-sized proteins of the sample.

For example, the proteins of the sample may be labeled with distinguishable labels to allow multiplexing of the sample. Some non-limiting examples of distinguishable labels include: fluorophores, fluorescent nanoparticles, quantum dots, magnetic nanoparticles, or DNA barcoded base linkers. Fluorophores used can include fluorescent proteins such as GFP, YFP, RFP, eGFP, mCherry, tdtomato, FITC, Alexa Fluor 350, Alexa Fluor 405, Alexa Fluor 488, Alexa Fluor532, Alexa Fluor 546, Alexa Fluor 555, Alexa Fluor 568, Alexa Fluor594, Alexa Fluor 647, Alexa Fluor 680, Alexa Fluor 750, Pacific Blue, coumarin, BODIPY FL, Pacific Green, Oregon Green, Cy3, Cy5, Pacific Orange, TRITC, Texas Red, phycoerythrin, and allophycocyanin.

Any number of protein samples can be multiplexed. For example, a multiplexed reaction may contain proteins from 2,3, 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, about 20, about 25, about 30, about 35, about 40, about 45, about 50, about 55, about 60, about 65, about 70, about 75, about 80, about 85, about 90, about 95, about 100, or more than about 100 initial samples. Distinguishable labels may provide a means of interrogating the sample from which each protein originated, or may direct the isolation of proteins from different samples to different areas on the solid support. In some embodiments, the protein is subsequently applied to the functionalized substrate, thereby chemically attaching the protein to the substrate.

Any number of protein samples can be mixed prior to analysis without labeling or multiplexing. For example, a multiplexed reaction may contain proteins from 2,3, 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, about 20, about 25, about 30, about 35, about 40, about 45, about 50, about 55, about 60, about 65, about 70, about 75, about 80, about 85, about 90, about 95, about 100, or more than about 100 initial samples. For example, a diagnosis of a rare condition can be made on the pooled samples. The analysis of a single sample may then be performed on only the samples in the sample wells that are positive for the diagnostic test. Samples can be multiplexed without labeling using a combinatorial pooling design in which samples are mixed into a pool in a manner that allows signals from individual samples to be resolved from the pool of analysis using computational demultiplexing.

The term "substrate" as used herein generally refers to a substrate capable of forming a solid support. Substrate or solid substrate may refer to any solid surface to which a protein may be covalently or non-covalently attached. Non-limiting examples of solid substrates include particles, beads, slides, surfaces of device elements, membranes, flow cells, wells, chambers, macrofluidic chambers, microfluidic chambers, channels, microfluidic channels, or any other surface. The substrate surface may be flat or curved, or may have other shapes, and may be smooth or textured. The substrate surface may contain micropores. In some embodiments, the substrate may be composed of glass, carbohydrates such as dextran, plastics such as polystyrene or polypropylene, polyacrylamides, latex, silicon, metals such as gold, or cellulose, and may be further modified to allow or enhance covalent or non-covalent attachment of proteins. For example, the substrate surface may be functionalized by modification with specific functional groups such as maleic or succinic moieties, or derivatized by modification with chemically reactive groups such as amino, mercapto or acrylic groups (e.g., by silanization). Suitable silane agents include aminopropyltrimethoxysilane, aminopropyltriethoxysilane and 4-aminobutyltriethoxysilane. The substrate may be functionalized with N-hydroxysuccinimide (NHS) functional groups. The glass surface may also be derivatized with other reactive groups such as acrylic or epoxy groups using, for example, epoxy silane, acrylic silane, or acrylamide silane. The substrate and method for protein attachment is preferably stable to repeated binding, washing, imaging and elution steps. In some examples, the substrate can be a glass slide, a flow cell, or a micro-scale or nano-scale structure (e.g., an ordered structure such as a microwell, a micropillar, a single molecule array, a nanosphere, a nanopillar, or a nanowire).

The spacing of the functional groups on the substrate can be ordered or random. The ordered array of functional groups can be created by, for example, photolithography, Dip-Pen (Dip-Pen) nanolithography, nanoimprint lithography, nanosphere lithography (nanobsphere lithography), nanosphere lithography (nanoball lithography), nanopillar array, nanowire lithography, scanning probe lithography, thermal chemical lithography, thermal scanning probe lithography, partial oxidation nanolithography, molecular self-assembly, stencil lithography, or electron beam lithography. The functional groups in the ordered array may be positioned such that each functional group is less than 200 nanometers (nm) from any other functional group, or about 200nm, about 225nm, about 250nm, about 275nm, about 300nm, about 325nm, about 350nm, about 375nm, about 400nm, about 425nm, about 450nm, about 475nm, about 500nm, about 525nm, about 550nm, about 575nm, about 600nm, about 625nm, about 650nm, about 675nm, about 700nm, about 725nm, about 750nm, about 775nm, about 800nm, about 825nm, about 850nm, about 875nm, about 900nm, about 925nm, about 950nm, about 975nm, about 1000nm, about 1025nm, about 1050nm, about 1075nm, about 1100nm, about 1125nm, about 1150nm, about 1175nm, about 1200nm, about 1225nm, about 1250nm, about 1300nm, about 1325nm, about 1375nm, about 1075nm, about 1425nm, about 1375nm, about 1400nm, about 1150nm, about 1400nm, about, About 1550nm, about 1575nm, about 1600nm, about 1625nm, about 1650nm, about 1675nm, about 1700nm, about 1725nm, about 1750nm, about 1775nm, about 1800nm, about 1825nm, about 1850nm, about 1875nm, about 1900nm, about 1925nm, about 1950nm, about 1975nm, about 2000nm, or over 2000 nm. The randomly spaced functional groups may be provided at a concentration such that the functional groups are on average at least about 50nm, about 100nm, about 150nm, about 200nm, about 250nm, about 300nm, about 350nm, about 400nm, about 450nm, about 500nm, about 550nm, about 600nm, about 650nm, about 700nm, about 750nm, about 800nm, about 850nm, about 900nm, about 950nm, about 1000nm, or more than 100nm away from any other functional group.

The substrate may be indirectly functionalized. For example, the substrate can be pegylated, and functional groups can be applied to all or a group of PEG molecules. The substrate may be functionalized using techniques suitable for micro-scale or nano-scale structures (e.g., ordered structures such as microwells, micropillars, single molecule arrays, nanospheres, nanopillars, or nanowires).

The substrate may comprise any material, including metal, glass, plastic, ceramic, or a combination thereof. In some preferred embodiments, the solid substrate may be a flow cell. The flow cell may be composed of a single layer or multiple layers. For example, the flow cell may comprise a base layer (e.g., a borosilicate glass layer), a channel layer (e.g., an etched silicon layer) overlying the base layer, and a capping or top layer. When the layers are assembled together, a closed channel may be formed with an inlet/outlet through the cover layer at either end. The thickness of each layer may vary, but is preferably less than about 1700 μm. These layers may be composed of suitable materials such as photosensitive glass, borosilicate glass, fused silicate, PDMS, or silicon. The different layers may be composed of the same material or different materials.

In some embodiments, the flow cell may comprise a channel opening on the bottom of the flow cell. The flow cell may contain millions of attached target conjugation sites in positions that can be discretely visualized. In some embodiments, various flow cells used with embodiments of the invention may comprise different numbers of channels (e.g., 1 channel, 2 or more channels, 3 or more channels, 4 or more channels, 6 or more channels, 8 or more channels, 10 or more channels, 12 or more channels, 16 or more channels, or more than 16 channels). The various flow cells may contain channels of different depths or widths, which may differ between channels within a single flow cell, or between channels of different flow cells. The depth and/or width of the individual channels may also vary. For example, at one or more points within a channel, the channel may be less than about 50 μm deep, less than about 100 μm deep, about 100 μm to about 500 μm deep, or more than about 500 μm deep. The channels may have any cross-sectional shape including, but not limited to, circular, semi-circular, rectangular, trapezoidal, triangular, or oval cross-sections.

The proteins may be spotted, dropped, pipetted, flowed, washed, or otherwise applied to the substrate. Where the substrate has been functionalized with moieties such as NHS esters, no modification of the protein is required. Where the substrate has been functionalized with an alternative moiety (e.g., thiol, amine, or linker nucleic acid), a crosslinking reagent (e.g., disuccinimidyl suberate, NHS, sulfonamide) may be used. In case the substrate has been functionalized with a linker nucleic acid, the proteins of the sample may be modified with complementary nucleic acid tags.

Photoactivatable crosslinkers can be used to induce crosslinking of the sample with specific regions on the substrate. Photoactivatable cross-linkers may be used to allow multiplexing of protein samples by attaching each sample in a known region of the substrate. Photoactivatable cross-linkers can allow for specific attachment of proteins that have been successfully labeled, for example, by detecting fluorescent tags prior to protein cross-linking. Examples of photoactivatable crosslinkers include, but are not limited to, N-5-azido-2-nitrobenzoyloxy succinimide, sulfosuccinimidyl 6- (4' -azido-2 ' -nitrophenylamino) hexanoate, succinimidyl 4,4' -azapentanoate, sulfosuccinimidyl 4,4' -azapentanoate, succinimidyl 6- (4,4' -azapentamido) hexanoate, sulfosuccinimidyl 6- (4,4' -azapentamido) hexanoate, succinimidyl 2- ((4,4' -azapentamido) ethyl) -1,3' -dithiopropionate and 2- ((4,4' -azapentamido) ethyl) -1, sulfosuccinimidyl 3' -dithiopropionate.

The polypeptide may be attached to the substrate by one or more residues. In some examples, the polypeptide may be attached via the N-terminus, C-terminus, both termini, or via internal residues.

In addition to permanent cross-linkers, the use of photocleavable linkers may also be suitable for some applications, and in doing so enables the selective extraction of proteins from the substrate after analysis. In some cases, photocleavable crosslinkers can be used for several different multiplexed samples. In some cases, photocleavable crosslinkers can be used for one or more samples in a multiplexing reaction. In some cases, the multiplexing reaction may comprise a control sample crosslinked to the substrate via a permanent crosslinker and an experimental sample crosslinked to the substrate via a photocleavable crosslinker.

Each conjugated protein may be spatially separated from each other such that each conjugated protein is optically resolvable. Thus, proteins can be individually tagged with unique spatial addresses. In some embodiments, this may be achieved by using a low concentration of protein and a low density of attachment sites on the substrate for conjugation such that each protein molecule is spatially separated from each other. In examples where photoactivatable cross-linking agents are used, a photopattern may be used such that the protein is attached to a predetermined location.

In some embodiments, each protein may be associated with a unique spatial address. For example, once proteins are attached to a substrate at spatially separated locations, each protein may be assigned an indexed address, e.g., by coordinates. In some instances, the grid of pre-assigned unique spatial addresses may be predetermined. In some embodiments, the substrate may contain readily identifiable immobilized tags such that the placement of each protein can be determined relative to the immobilized tags of the substrate. In some examples, the substrate may have grid lines and/or "origins" or other reference points permanently marked on the surface. In some examples, the surface of the substrate may be permanently or semi-permanently labeled to provide a reference to locate the crosslinked protein. The patterned shape itself, such as the outer boundary of the conjugated polypeptide, may also be used as a reference point for determining the unique location of each spot.

The substrate may also contain conjugated protein standards and controls. The conjugated protein standards and controls may be peptides or proteins of known sequence that have been conjugated at known positions. In some examples, the conjugated protein standards and controls may serve as internal controls in the assay. Proteins can be applied to the substrate from a purified Protein stock, or can be synthesized on the substrate by a process such as Nucleic Acid-Programmable Protein Array (NAPPA).

In some examples, the substrate may comprise a fluorescent standard. These fluorescent standards can be used to calibrate the fluorescence signal intensity between assays. These fluorescence standards can also be used to correlate fluorescence signal intensity with the number of fluorophores present in the region. The fluorescent standard may comprise some or all of the different types of fluorophores used in the assay.

Once the substrate is conjugated with the proteins from the sample, multi-affinity reagent measurements can be performed. The measurement process described herein may employ various affinity reagents. In some embodiments, multiple affinity reagents can be mixed together, and the binding of the affinity reagent mixture to the protein-substrate conjugate can be measured. In some cases, the measurement of binding of the mixture of affinity reagents may vary in different solvent conditions and/or protein folding conditions; thus, repeated measurements may be made on the same affinity reagent or a set of affinity reagents under such varying solvent conditions and/or protein folding conditions to obtain different sets of binding measurements. In some cases, a different set of binding measurements may be obtained by performing repeated measurements on a sample in which the protein has been treated with or without an enzyme (e.g., treated with a glycosidase, a phosphorylase, or a phosphatase).

As used herein, the term "affinity reagent" generally refers to a reagent that binds to a protein or peptide with reproducible specificity. For example, the affinity reagent may be an antibody, an antibody fragment, an aptamer, a micro-protein conjugate, or a peptide. In some embodiments, the micro-protein conjugates may include protein conjugates that may be between 30-210 amino acids in length. In some embodiments, a micro-protein conjugate may be designed. For example, protein conjugates can include peptidic macrocycles (e.g., as described in [ Hosseinzadeh et al, "comparative computational design of ordered peptidemacrycycles," Science, 12.15.2017; 358(6369): 1461-. In some embodiments, monoclonal antibodies may be preferred. In some embodiments, antibody fragments such as Fab fragments may be preferred. In some embodiments, the affinity reagent can be a commercially available affinity reagent, such as a commercially available antibody. In some embodiments, the desired affinity reagents can be selected by screening commercially available affinity reagents to identify affinity reagents having useful properties.

The affinity reagent may have high, medium or low specificity. In some examples, affinity reagents can recognize several different epitopes. In some examples, the affinity reagent can recognize an epitope present in two or more different proteins. In some examples, affinity reagents can recognize epitopes present in a variety of different proteins. In some cases, affinity reagents used in the methods of the present disclosure can be highly specific for a single epitope. In some cases, affinity reagents used in the methods of the present disclosure can be highly specific for a single epitope containing post-translational modifications. In some cases, affinity reagents may have highly similar epitope specificities. In some cases, affinity reagents with highly similar epitope specificities can be specifically designed to resolve highly similar protein candidate sequences (e.g., candidates with single amino acid variants or isoforms). In some cases, affinity reagents may have a high diversity of epitope specificity to maximize coverage of protein sequences. In some embodiments, due to the random nature of the binding of the probe to the protein-substrate, experiments can be repeated with the same affinity probe, with the expectation that the results may differ, thereby providing additional information for protein identification.

In some cases, one or more specific epitopes recognized by an affinity reagent may not be completely known. For example, affinity reagents can be designed or selected for specific binding to one or more intact proteins, protein complexes, or protein fragments without knowledge of the particular binding epitope. Through the identification process, the binding profile of the agent may already be known in detail. Binding measurements using the affinity reagents can be used to determine protein identity even if the specific binding epitope is unknown. For example, commercially available antibodies or aptamers designed for binding to protein targets may be used as affinity reagents. After identification under assay conditions (e.g., complete folding, partial denaturation, or complete denaturation), binding of the affinity reagent to the unknown protein can provide information about the identity of the unknown protein. In some cases, a collection of protein-specific affinity reagents (e.g., commercially available antibodies or aptamers) can be used to generate protein identifications, with or without knowledge of the particular epitope they target. In some cases, the collection of protein-specific affinity reagents may comprise about 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 10000, 20000, or more than 20000 affinity reagents. In some cases, the collection of affinity reagents may comprise all commercially available affinity reagents that demonstrate target reactivity in a particular organism. For example, a collection of protein-specific affinity reagents may be assayed in series, with binding measurements being performed for each affinity reagent individually. In some cases, a subset of protein-specific affinity reagents may be mixed prior to the binding measurement. For example, for each binding measurement run, a new mixture of affinity reagents may be selected that contains a subset of the affinity reagents randomly selected from the complete set. For example, each subsequent mixture may be generated in the same random manner, with the expectation that many affinity reagents will be present in more than one mixture. In some cases, protein identification can be generated more rapidly using a mixture of protein-specific affinity reagents. In some cases, such mixtures of protein-specific affinity reagents may increase the percentage of unknown protein that the affinity reagents bind in any individual run. The mixture of affinity reagents may comprise about 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or more of all available affinity reagents. Mixtures of affinity reagents evaluated in a single experiment may or may not share individual affinity reagents. In some cases, there may be multiple different affinity reagents within a collection that bind to the same protein. In some cases, each affinity reagent in the collection can bind to a different protein. In the case where multiple affinity reagents with affinity for the same protein bind to a single unknown protein, the confidence that the identity of the unknown protein is a common target for the affinity reagents may be increased. In some cases, where multiple affinity reagents bind different epitopes on the same protein, the use of multiple protein affinity reagents targeting the same protein may provide redundancy, and binding of only a subset of affinity reagents targeting the protein may be interfered with by post-translational modifications or other steric hindrance of the binding epitope. In some cases, binding of an affinity reagent that binds an unknown epitope can be used in conjunction with a binding measurement of an affinity reagent that binds a known epitope to generate a protein identification.

In some examples, one or more affinity reagents may be selected to bind an amino acid motif of a given length, such as 2,3, 4,5, 6, 7, 8, 9, 10, or more than 10 amino acids. In some examples, one or more affinity reagents may be selected to bind a series of amino acid motifs of varying lengths ranging from 2 amino acids to 40 amino acids.

In some cases, the affinity reagents may be labeled with a nucleic acid barcode. In some examples, nucleic acid barcodes can be used to purify affinity reagents after use. In some examples, nucleic acid barcodes can be used to sort affinity reagents for reuse. In some cases, the affinity reagents may be labeled with fluorophores that can be used to sort the affinity reagents after use.

The family of affinity reagents may comprise one or more types of affinity reagents. For example, the methods of the present disclosure may use a family of affinity reagents comprising one or more of antibodies, antibody fragments, Fab fragments, aptamers, peptides, and proteins.

Affinity reagents may be modified. Examples of modifications include, but are not limited to, attachment of a detection moiety. The detection moiety may be attached directly or indirectly. For example, the detection moiety may be covalently attached directly to the affinity reagent, or may be attached via a linker, or may be attached via an affinity reaction, such as a complementary nucleic acid tag or a biotin streptavidin pair. Attachment methods that are capable of withstanding mild washing and elution of affinity reagents may be preferred.

The affinity reagents may be labeled, for example, with a distinguishable label, to allow identification or quantification of the binding event (e.g., using fluorescent detection of the binding event). Some non-limiting examples of distinguishable labels include: fluorophores, magnetic nanoparticles, or nucleic acid barcoded base linkers. Fluorophores used can include fluorescent proteins such as GFP, YFP, RFP, eGFP, mCherry, tdtomato, FITC, Alexa Fluor 350, Alexa Fluor 405, Alexa Fluor 488, Alexa Fluor532, Alexa Fluor 546, Alexa Fluor 555, Alexa Fluor 568, Alexa Fluor594, Alexa Fluor 647, Alexa Fluor 680, Alexa Fluor 750, Pacific Blue, coumarin, BODIPY FL, Pacific Green, Oregon Green, Cy3, Cy5, Pacific Orange, TRITC, Texas Red, phycoerythrin, and allophycocyanin. Alternatively, the affinity reagents may be unlabeled, for example when the binding event is detected directly, for example by Surface Plasmon Resonance (SPR) detection of the binding event.

Examples of detection moieties include, but are not limited to, fluorophores, bioluminescent proteins, nucleic acid segments comprising a constant region and a barcode region, or chemical tethers (teters) for attachment to nanoparticles such as magnetic particles. For example, affinity reagents can be labeled with DNA barcodes, and then can be specifically sequenced at their locations. As another example, a set of different fluorophores can be used as the detection moiety by a Fluorescence Resonance Energy Transfer (FRET) detection method. The detection moiety may comprise several different fluorophores with different excitation or emission patterns.

The detection moiety may be cleavable from the affinity reagent. This may allow the step of removing the detection moiety from the affinity reagent that is no longer of interest to reduce signal contamination.

In some cases, the affinity reagent is unmodified. For example, if the affinity reagent is an antibody, the presence of the antibody can be detected by atomic force microscopy. The affinity reagents may be unmodified and may be detected, for example, by antibodies specific for one or more affinity reagents. For example, if the affinity reagent is a mouse antibody, the mouse antibody can be detected by using an anti-mouse secondary antibody. Alternatively, the affinity reagent may be an aptamer that is detected by an antibody specific for the aptamer. The second antibody may be modified with a detection moiety as described above. In some cases, the presence of the second antibody can be detected by atomic force microscopy.

In some examples, the affinity reagents may comprise the same modification, such as conjugated green fluorescent protein, or may comprise two or more different types of modifications. For example, each affinity reagent may be conjugated to one of several different fluorescent moieties each having a different excitation or emission wavelength. This may allow multiplexing of affinity reagents, as several different affinity reagents may be combined and/or distinguished. In one example, a first affinity reagent may be conjugated to green fluorescent protein, a second affinity reagent may be conjugated to yellow fluorescent protein, and a third affinity reagent may be conjugated to red fluorescent protein, so the three affinity reagents may be multiplexed and identified by their fluorescence. In another example, the first, fourth, and seventh affinity reagents can be conjugated to green fluorescent protein, the second, fifth, and eighth affinity reagents can be conjugated to yellow fluorescent protein, and the third, sixth, and ninth affinity reagents can be conjugated to red fluorescent protein; in this case, the first, second and third affinity reagents may be multiplexed together, while the second, fourth and seventh affinity reagents and the third, sixth and ninth affinity reagents form two further multiplexing reactions. The number of affinity reagents that can be multiplexed together may depend on the detection moiety used to distinguish them. For example, multiplexing of fluorophore-labeled affinity reagents may be limited by the number of unique fluorophores available. For further examples, multiplexing of affinity reagents labeled with nucleic acid tags may be determined by the length of the nucleic acid barcode. The nucleic acid may be deoxyribonucleic acid (DNA) or ribonucleic acid (RNA).

The specificity of each affinity reagent can be determined prior to use in the assay. The binding specificity of an affinity reagent can be determined in a control experiment using known proteins. Any suitable assay method may be used to determine the specificity of the affinity reagent. In one example, the substrate can be loaded with known protein standards at known locations and used to assess the specificity of various affinity reagents. In another example, the substrate may comprise a panel of test samples and controls and standards, such that the specificity of each affinity reagent can be calculated from the binding to the controls and standards and then used to identify the test samples. In some cases, affinity reagents of unknown specificity may be included, and data from affinity reagents of known specificity may be used to identify a protein, while the pattern of binding of an affinity reagent of unknown specificity to the identified protein may be used to determine its binding specificity. The specificity of any individual affinity reagent can also be reconfirmed by using the known binding data of other affinity reagents to assess which proteins the individual affinity reagent binds to. In some cases, the frequency of binding of an affinity reagent to each known protein conjugated to a substrate can be used to derive the probability of binding to any protein on the substrate. In some cases, the frequency of binding to a known protein comprising an epitope (e.g., an amino acid sequence or post-translational modification) can be used to determine the probability that an affinity reagent binds to a particular epitope. Thus, by multiple uses of the set of affinity reagents, the specificity of the affinity reagents can be gradually improved with each iteration. Although affinity reagents having unique specificity for a particular protein may be used, they may not be required for the methods described herein. In addition, the method may be effective for a range of specificities. In some examples, the methods described herein may be particularly effective when the affinity reagent is not specific for any particular protein, but specific for an amino acid motif (e.g., the tripeptide AAA).

In some examples, affinity reagents with high, medium, or low binding affinities may be selected. In some cases, affinity reagents with low or medium binding affinity may be preferred. In some cases, the affinity reagent may have a molecular weight of about 10-3M、10-4M、10-5M、10-6M、10-7M、10-8M、10-9M、10-10M or less than about 10-10Dissociation constant of M. In some cases, the affinity reagent can have a molecular weight of greater than about 10-10M、10-9M、10-8M、10-7M、10-6M、10-5M、10-4M、10-3M、10-2M is greater than 10-2Dissociation constant of M. In some cases, with a low or medium koffRate or medium or high konRate of affinity reagents may be preferred.

Some affinity reagents may be selected to bind to modified amino acid sequences, such as phosphorylated or ubiquitinated amino acid sequences. In some examples, one or more affinity reagents may be selected that have broad specificity for a family of epitopes that may be comprised by one or more proteins. In some examples, one or more affinity reagents may bind two or more different proteins. In some examples, one or more affinity reagents may bind weakly to one or more of its targets. For example, an affinity reagent may bind less than 10%, less than 15%, less than 20%, less than 25%, less than 30%, or less than 35% of one or more of its targets. In some examples, one or more affinity reagents may bind moderately or strongly to one or more targets thereof. For example, an affinity reagent may bind to more than 35%, more than 40%, more than 45%, more than 60%, more than 65%, more than 70%, more than 75%, more than 80%, more than 85%, more than 90%, more than 91%, more than 92%, more than 93%, more than 94%, more than 95%, more than 96%, more than 97%, more than 98%, or more than 99% of one or more of its targets.

To compensate for weak binding, an excess of affinity reagent may be applied to the substrate. The affinity reagent may be applied in an excess of about 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, or 10:1 relative to the sample protein. The affinity reagent may be applied in an excess of about 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, or 10:1 relative to the expected occurrence of an epitope in the sample protein.

To compensate for the high affinity agent off-rate, a linker moiety can be attached to each affinity agent and used to reversibly link the bound affinity agent to the substrate or unknown protein to which it binds. For example, DNA tags may be attached to the ends of each affinity reagent, while different DNA tags are attached to the substrate or to each unknown protein. After the affinity reagent has hybridized to the unknown protein, the linker DNA complementary to the affinity reagent-associated DNA tag at one end and the substrate-associated tag at the other end can be washed on the chip to allow the affinity reagent to bind to the substrate and prevent dissociation of the affinity reagent prior to measurement. After binding, the linked affinity reagents can be released by washing in the presence of heat or high salt concentrations that disrupt the DNA linkages.

Figure 21 shows the enhancement of binding between an affinity reagent and a protein by two hybridization steps, according to some embodiments. In particular, step 1 of FIG. 21 shows affinity reagent hybridization. As seen in step 1, affinity reagent 2110 hybridizes to protein 2130. Protein 2130 binds to slide 2105. As seen in step 1, affinity reagent 2110 has attached DNA tag 2120. In some embodiments, the affinity reagent may be attached to more than one DNA tag. In some embodiments, the affinity reagent may have attached 1, 2,3, 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more than 20 DNA tags. DNA tag 2120 includes a single-stranded DNA (ssdna) tag having a recognition sequence 2125. In addition, protein 2130 comprises two DNA tags 2140. In some embodiments, the DNA tag may be added using a chemical method that reacts with cysteine in the protein. In some embodiments, the protein may have more than one DNA tag attached. In some embodiments, the protein may have attached 1, 2,3, 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, or more than 100 DNA tags. Each DNA tag 2140 comprises a ssDNA tag having a recognition sequence 2145.

As seen in step 2, DNA linker 2150 hybridizes to DNA tags 2120 and 2140 attached to affinity reagent 2110 and protein 2130, respectively. The DNA linker 2150 comprises ssDNA having sequences complementary to the recognition sequences 2125 and 2145, respectively. In addition, recognition sequences 2125 and 2145 are located on DNA linker 2150 to allow DNA linker 2150 to bind to both DNA tags 2120 and 2140 simultaneously, as shown in step 2. In particular, the first region 2152 of the DNA linker 2150 selectively hybridizes to the recognition sequence 2125, and the second region 2154 of the DNA linker 2150 selectively hybridizes to the recognition sequence 2145. In some embodiments, the first region 2152 and the second region 2154 can be spaced apart from each other on the DNA linker. In particular, in some embodiments, the first region of the DNA linker and the second region of the DNA linker may be separated between the first region and the second region with a non-hybridizing spacer sequence. Furthermore, in some embodiments, the sequence of the recognition sequence may be less than fully complementary to the DNA linker and may still bind to the DNA linker sequence. In some embodiments, the recognition sequence can be less than 5 nucleotides, 6 nucleotides, 7 nucleotides, 8 nucleotides, 9 nucleotides, 10 nucleotides, 11 nucleotides, 12 nucleotides, 13 nucleotides, 14 nucleotides, 15 nucleotides, 16 nucleotides, 17 nucleotides, 18 nucleotides, 19 nucleotides, 20 nucleotides, 21 nucleotides, 22 nucleotides, 23 nucleotides, 24 nucleotides, 25 nucleotides, 26 nucleotides, 27 nucleotides, 28 nucleotides, 29 nucleotides, or 30 nucleotides in length or more than 30 nucleotides in length. In some embodiments, the recognition sequence may have one or more mismatches with a complementary DNA tag sequence. In some embodiments, about one tenth of the nucleotides of the recognition sequence may be mismatched with the complementary DNA tag sequence and still hybridize to the complementary DNA tag sequence. In some embodiments, less than one tenth of the nucleotides of the recognition sequence may be mismatched with the complementary DNA tag sequence and still hybridize to the complementary DNA tag sequence. In some embodiments, about two tenths of a nucleotide of the recognition sequence may be mismatched with the complementary DNA tag sequence and still hybridize to the complementary DNA tag sequence. In some embodiments, more than two tenths of a nucleotide of the recognition sequence may be mismatched with the complementary DNA tag sequence and still hybridize to the complementary DNA tag sequence.

The affinity reagent may further comprise a magnetic component. The magnetic component may be used to manipulate some or all of the bound affinity reagents into the same imaging plane or z-stack (stack). Manipulating some or all of the affinity reagents into the same imaging plane can improve the quality of the imaging data and reduce noise in the system.

As used herein, the term "detector" generally refers to a device capable of detecting a signal, including a signal indicative of the presence or absence of a binding event of an affinity reagent to a protein. The signal may be a direct signal indicative of the presence or absence of a binding event, such as a Surface Plasmon Resonance (SPR) signal. The signal may be an indirect signal, such as a fluorescent signal, indicative of the presence or absence of a binding event. In some cases, the detector may include optical and/or electronic components that can detect the signal. The term "detector" may be used in the detection method. Non-limiting examples of detection methods include optical detection, spectroscopic detection, electrostatic detection, electrochemical detection, magnetic detection, fluorescent detection, Surface Plasmon Resonance (SPR), and the like. Examples of optical detection methods include, but are not limited to, fluorimetry and ultraviolet-visible light absorption. Examples of spectroscopic detection methods include, but are not limited to, mass spectrometry, Nuclear Magnetic Resonance (NMR) spectroscopy, and infrared spectroscopy. Examples of electrostatic detection methods include, but are not limited to, gel-based techniques such as gel electrophoresis. Examples of electrochemical detection methods include, but are not limited to, electrochemical detection of amplification products after separation of the amplification products by high performance liquid chromatography.

Identification of proteins in samples

Proteins are important building blocks of cells and tissues of living organisms. A given organism produces a large set of different proteins, commonly referred to as a proteome. Proteomes can vary over time and with various stages (e.g., cell cycle stages or disease states) that a cell or organism undergoes. Large-scale studies or measurements of proteomes (e.g., experimental analysis) can be referred to as proteomics. In proteomics, there are a variety of methods for identifying proteins, including immunoassays (e.g., enzyme linked immunosorbent assay (ELISA) and Western blotting), mass spectrometry-based methods (e.g., matrix-assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI)), hybrid methods (e.g., Mass Spectrometry Immunoassay (MSIA)), and protein microarrays. For example, single molecule proteomics approaches can attempt to infer the identity of protein molecules in a sample by a variety of methods, ranging from direct functionalization of amino acids to the use of affinity reagents. The information or measurements collected from such methods are typically analyzed by a suitable algorithm to identify the proteins present in the sample.

Accurate quantification of proteins can also be challenging due to lack of sensitivity, lack of specificity, and detector noise. In particular, accurate quantification of proteins in a sample may be challenging due to random and unpredictable systematic variations in detector signal levels, which may lead to errors in protein identification and quantification. In some cases, instrumentation and detection systematics may be calibrated and removed by monitoring instrument diagnostics and common mode behavior. However, binding of proteins (e.g., by affinity reagent probes) is inherently a probabilistic process, and binding sensitivity and specificity may be undesirable.

The present disclosure provides methods and systems for accurate and efficient identification of proteins. The methods and systems provided herein can significantly reduce or eliminate errors in identifying proteins in a sample. Such methods and systems can enable accurate and efficient identification of candidate proteins within unknown protein samples. The protein identification may be based on calculations using empirically measured information of unknown proteins in the sample. For example, empirical measurements may include binding information for affinity reagent probes configured to selectively bind to one or more candidate proteins, protein length, protein hydrophobicity, and/or isoelectric point. Protein identification can be optimized to be calculable with minimal memory usage. The protein identification may include estimating a confidence level for the presence of each of the one or more candidate proteins in the sample.

In one aspect, disclosed herein is a computer-implemented method 100 (e.g., as shown in fig. 1) of identifying proteins within an unknown protein sample. The method can be applied independently to each unknown protein in the sample to generate a collection of proteins identified in the sample. The amount of protein can be calculated by counting the number of identifications of each candidate protein. The method of identifying a protein can include receiving, by a computer, a plurality of empirically measured information of an unknown protein in a sample (e.g., step 105). Empirical measurements may include (i) a measure of binding of each of the one or more affinity reagent probes to one or more unknown proteins in the sample, (ii) a length of the one or more unknown proteins; (iii) hydrophobicity of one or more unknown proteins; and/or (iv) the isoelectric point of one or more unknown proteins. In some embodiments, the plurality of affinity reagent probes may comprise a pool of a plurality of individual affinity reagent probes. For example, the pool of affinity reagent probes may comprise 2,3, 4,5, 6, 7, 8, 9, 10 or more than 10 types of affinity reagent probes. In some embodiments, the affinity reagent probe pool may comprise 2 types of affinity reagent probes, the combination of which constitutes a majority of the composition of the affinity reagent probes in the affinity reagent probe pool. In some embodiments, the affinity reagent probe pool may comprise 3 types of affinity reagent probes, the combination of which constitutes a majority of the composition of the affinity reagent probes in the affinity reagent probe pool. In some embodiments, the affinity reagent probe pool may comprise 4 types of affinity reagent probes, the combination of which constitutes a majority of the composition of the affinity reagent probes in the affinity reagent probe pool. In some embodiments, the affinity reagent probe pool may comprise 5 types of affinity reagent probes, the combination of which constitutes a majority of the composition of the affinity reagent probes in the affinity reagent probe pool. In some embodiments, the pool of affinity reagent probes may comprise more than 5 types of affinity reagent probes, the combination of which constitutes a majority of the composition of the affinity reagent probes in the pool of affinity reagent probes. Each affinity reagent probe may be configured to selectively bind to one or more candidate proteins of the plurality of candidate proteins. The affinity reagent probes may be k-mer affinity reagent probes. In some embodiments, each k-mer affinity reagent probe is configured to selectively bind to one or more candidate proteins of the plurality of candidate proteins. The empirically measured information may comprise binding measurements for a set of probes that are believed to have bound to an unknown protein.

Next, at least a portion of the empirically measured information for the unknown protein can be compared by the computer to a database comprising a plurality of protein sequences (e.g., step 110). Each protein sequence may correspond to a candidate protein of the plurality of candidate proteins. The plurality of candidate proteins may comprise at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 150, at least 200, at least 250, at least 300, at least 350, at least 400, at least 450, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, or more than 1000 different candidate proteins.

Next, for each of one or more candidate proteins in the plurality of candidate proteins, a probability that an empirical measurement of the candidate protein will generate an observed measurement may be calculated or generated by the computer (e.g., in step 115). As used herein, the term "measurement" refers to information observed when a measurement is taken. For example, the measurement of an affinity reagent binding assay may be a positive or negative result, such as binding or non-binding of the reagent. As another example, the measurement result of an experiment measuring the length of a protein may be 417 amino acids. Additionally or alternatively, for each of one or more candidate proteins in the plurality of candidate proteins, a probability that an empirical measurement of the candidate protein will not generate an observed measurement may be calculated or generated by the computer. Additionally or alternatively, the probability that an empirical measurement of the candidate protein will yield an unobserved measurement may be calculated or generated by a computer. Additionally or alternatively, the probability that a series of empirical measurements on the candidate protein will generate a result set may be calculated or generated by a computer.

As used herein, a "result set" refers to a plurality of independent measurements of a protein. For example, a series of empirical affinity reagent binding measurements may be performed on unknown proteins. The binding measurement of each individual affinity reagent comprises a measurement result, and the set of all measurement results is a result set. In some cases, the result set may be a subset of all observed results. In some cases, the result set may consist of measurements that are not observed empirically. Additionally or alternatively, for each of one or more candidate proteins in the plurality of candidate proteins, the probability that the unknown protein is a candidate protein may be calculated or generated by a computer. The calculation or generation of steps 115 and/or 120 may be performed iteratively or non-iteratively. The probabilities in step 115 may be generated based on a comparison of empirical measurements of unknown proteins to a database containing multiple protein sequences for all candidate proteins. Thus, the input to the algorithm may include a database of candidate protein sequences and a set of empirical measurements of unknown proteins (e.g., probes believed to have bound to the unknown protein, the length of the unknown protein, the hydrophobicity of the unknown protein, and/or the isoelectric point of the unknown protein). In some cases, the input to the algorithm may include a parameter related to estimating the probability (e.g., trimer-level binding probability of each affinity agent) that any affinity agent generates any binding measure for any candidate protein. The output of the algorithm may include (i) the probability of observing a measurement or set of results given a hypothetical candidate protein identity, (ii) the most likely identity selected from the set of candidate proteins for an unknown protein, and the probability that the identification is correct given a measurement or set of results (e.g., in step 120), and/or (iii) a set of high probability candidate protein identities and associated probabilities that the unknown protein is one of the set of proteins. The probability of observing a measurement assuming that the candidate protein is the protein being measured can be expressed as: p (measurement | protein).

In some embodiments, P (measurement | protein) is calculated entirely by computer. In some embodiments, P (measurement | protein) is calculated based on, or derived from, features of the amino acid sequence of the protein. In some embodiments, P (measurement | protein) is calculated independent of knowledge of the amino acid sequence of the protein. For example, P (measurement | protein) can be determined empirically by: measurements were taken in repeated experiments for candidate protein isolates and P (measurement | protein) was calculated from the following frequencies: (number of measurements with results/total number of measurements). In some embodiments, P (measurement | protein) is derived from a database of past measurements of proteins. In some embodiments, P (measurement | protein) is calculated by: a set of confident protein identifications is generated from the collection of unknown proteins whose measurements are truncated, and then the frequency of measurements between the set of unknown proteins confidently identified as candidate proteins is calculated. In some embodiments, a seed value for P (measurement | protein) may be used to identify a set of unknown proteins, and the seed value is refined based on the frequency of measurements between unknown proteins that are confidently matched to the candidate protein. In some embodiments, the process is repeated, wherein new accreditations are generated based on the updated measurement probabilities, and then new measurement probabilities are generated from the updated set of confident accreditations.

Assuming that the candidate protein is the protein being measured, the probability that no measurement is observed can be expressed as:

p (not measured | protein) ═ 1-P (measured | protein).

Assuming that the candidate protein is the protein being measured, the probability of observing a measurement set consisting of N individual measurements can be expressed as the product of the probabilities of each individual measurement:

p (result set | protein) ═ P (measurement 1| protein) × P (measurement 2| protein) × … × P (measurement M | protein)

Unknown proteins as candidate proteins (proteins)i) May be calculated based on the probabilities of the result set for each possible candidate protein.

In some embodiments, the measurement result set comprises binding of affinity reagent probes. In some embodiments, the measurement set comprises non-specific binding of affinity reagent probes.

In some embodiments, the protein in the sample is truncated or degraded. In some embodiments, the protein in the sample does not contain the C-terminus of the original protein. In some embodiments, the protein in the sample does not contain the N-terminus of the original protein. In some embodiments, the protein in the sample does not contain the N-terminus and does not contain the C-terminus of the original protein.

In some embodiments, the empirical measurements comprise measurements performed on a mixture of antibodies. In some embodiments, the empirical measurements comprise measurements performed on protein-containing samples from a plurality of species. In some embodiments, the empirical measurements comprise measurements performed on samples derived from humans. In some embodiments, the empirical measurements comprise measurements performed on samples derived from species other than human. In some embodiments, the empirical measurements comprise measurements taken on a sample in the presence of single amino acid variations (SAV) caused by non-synonymous Single Nucleotide Polymorphisms (SNPs). In some embodiments, the empirical measurements include measurements of the sample in the presence of genomic structural variations such as insertions, deletions, translocations, inversions, segment repeats, or Copy Number Variations (CNVs) affecting the sequence of the protein in the sample.

In some embodiments, the method further comprises applying the method to all unknown proteins measured in the sample. In some embodiments, the method further comprises generating, for each of the one or more candidate proteins, a confidence level that the candidate protein matches the unknown protein measured in the sample. The confidence level may include a probability value. Alternatively, the confidence level may include a probability value with an error. Alternatively, the confidence level may include a range of probability values, optionally with a confidence level (e.g., about 90%, about 95%, about 96%, about 97%, about 98%, about 99%, about 99.9%, about 99.99%, about 99.999%, about 99.9999%, about 99.99999%, about 99.999999%, about 99.9999999%, about 99.999999%, about 99.999999999%, about 99.99999999%, about 99.9999999%, about 99.999999%, about 99.99999999999%, about 99.999999999999%, about 99.9999999999999% confidence or above 99.9999999999999% confidence).

In some embodiments, the method further comprises generating a probability that the candidate protein is present in the sample.

In some embodiments, the method further comprises generating protein identifications and associated probabilities independently of each unknown protein in the sample and generating a list of all unique proteins identified in the sample. In some embodiments, the method further comprises counting the identification numbers generated for each unique candidate protein to determine the amount of each candidate protein in the sample. In some embodiments, the set of protein identifications and associated probabilities may be filtered to include only high scoring, high confidence, and/or low false discovery rate identifications.

In some embodiments, the probability of binding of an affinity reagent to the full-length candidate protein can be generated. In some embodiments, the probability of binding of an affinity reagent to a protein fragment (e.g., a subsequence of the complete protein sequence) can be generated. For example, if unknown proteins are treated and conjugated to a substrate in a manner such that only the first 100 amino acids of each unknown protein are conjugated, then a binding probability can be generated for each protein candidate such that all binding probabilities that bind epitopes other than the first 100 amino acids are set to zero or to a very low probability that represents an error rate. Similar methods can be used if the first 10, 20, 50, 100, 150, 200, 300, 400, or more than 400 amino acids of each protein are conjugated to the substrate. Similar methods can be used if the last 10, 20, 50, 100, 150, 200, 300, 400, or more than 400 amino acids are conjugated to the substrate.

In some embodiments, a set of potential protein candidate matches may be assigned to an unknown protein in the event that a single protein candidate match cannot be assigned to the unknown protein. A confidence level may be assigned to an unknown protein that is one of any protein candidates in the set. The confidence level may include a probability value. Alternatively, the confidence level may include a probability value with an error. Alternatively, the confidence level may include a range of probability values, optionally with a confidence level (e.g., about 90%, about 95%, about 96%, about 97%, about 98%, about 99%, about 99.9%, about 99.99%, about 99.999%, about 99.9999%, about 99.99999%, about 99.999999%, about 99.9999999%, about 99.999999%, about 99.999999999%, about 99.99999999%, about 99.9999999%, about 99.999999%, about 99.99999999999%, about 99.999999999999%, about 99.9999999999999% confidence or above 99.9999999999999% confidence). For example, an unknown protein may be a strong match to two protein candidates. The two protein candidates may have a high degree of sequence similarity to each other (e.g., two protein isoforms, such as proteins having a single amino acid variant as compared to a canonical sequence). In these cases, there may be no individual protein candidates assigned a high confidence, but the high confidence may be attributable to unknown proteins that match a single but unknown member of the "protein group" that contains the two strongly matching protein candidates.

In some embodiments, efforts may be made to detect cases in which the unknown protein is not optically resolved. For example, in rare cases, two or more proteins may be bound in the same "well" or location of the substrate, although efforts are made to avoid this. In some cases, the conjugated protein may be treated with a non-specific dye and the signal from the dye measured. If two or more proteins are not optically resolved, the signal generated by the dye may be higher than the position containing a single protein, and may be used to label the position with multiple bound proteins.

In some embodiments, the plurality of candidate proteins is generated or modified by sequencing or analyzing DNA or RNA of a human or organism from which the unknown protein sample is obtained or derived.

In some embodiments, the method further comprises obtaining information about post-translational modifications of the unknown protein. Information about a post-translational modification can include the presence of the post-translational modification without knowing the nature of the particular modification. This database can be considered as the index product of PTM. For example, once a protein candidate sequence has been assigned to an unknown protein, the pattern of affinity reagent binding for the determined protein may be compared to a database containing binding measurements of affinity reagents from previous experiments with the same candidate. For example, a database of binding measurements can be derived from binding to a Nucleic Acid-Programmable Protein Array (NAPPA) containing unmodified proteins of known sequence at known positions.

Additionally or alternatively, a database of binding measurements can be obtained from previous experiments in which protein candidate sequences are confidently assigned to unknown proteins. The difference in binding measurements between the protein being tested and the existing measurement database can provide information about the likelihood of post-translational modification. For example, if an affinity agent has a high binding frequency to a candidate protein in the database, but does not bind to the protein being assayed, there is a high likelihood that a post-translational modification will be present somewhere on the protein. If the binding epitope of the affinity reagent is known for which there is a binding difference, the location of the post-translational modification can be located at or near the binding epitope of the affinity reagent. In some embodiments, information about a particular post-translational modification can be derived by performing repeated affinity reagent measurements before and after treating the protein-substrate conjugate with an enzyme that specifically removes the particular post-translational modification. For example, a series of binding measurements of the affinity reagent can be taken prior to treating the substrate with the phosphatase enzyme, and then repeated after treating the substrate with the phosphatase enzyme. Affinity reagents that bind to an unknown protein prior to phosphatase treatment but do not bind (differentially bind) after phosphatase treatment may provide evidence of phosphorylation. If the epitope recognized by the differentially binding affinity reagent is known, phosphorylation can be located at or near the binding epitope of the affinity reagent.

In some cases, a count of a particular post-translational modification can be determined using a binding measurement of an affinity reagent for the particular post-translational modification. For example, antibodies that recognize phosphorylation events can be used as affinity reagents. Binding of the agent may indicate the presence of at least one phosphorylation on the unknown protein. In some cases, the number of discrete post-translational modifications of a particular type on an unknown protein can be determined by counting the number of binding events measured for a particular post-translational modification-specific affinity reagent. For example, the phosphorylation-specific antibody can be conjugated to a fluorescent reporter. In this case, the intensity of the fluorescent signal can be used to determine the amount of phosphorylated specific affinity reagent bound to the unknown protein. The number of phosphorylation-specific affinity reagents that bind to the unknown protein can then be used to determine the number of phosphorylation sites on the unknown protein. In some embodiments, evidence from affinity reagent binding experiments can be combined with prior knowledge of amino acid sequence motifs or specific protein positions that are likely to be post-translationally modified (e.g., from dbPTM, phosphosite plus, or UniProt) to arrive at a more accurate count, identification, or location of post-translational modifications. For example, if the position of a post-translational modification cannot be accurately determined from affinity measurements alone, it may be possible to favor positions that contain amino acid sequence motifs that are often associated with the post-translational modification of interest.

In some embodiments, the probabilities are iteratively generated until a predetermined condition is satisfied. In some embodiments, the predetermined condition comprises generating each of the plurality of probabilities with a confidence of at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, at least 99.9%, at least 99.99%, at least 99.999%, at least 99.9999%, at least 99.99999%, at least 99.999999%, at least 99.99999999%, at least 99.99999999999%, at least 99.99999999%, at least 99.99999999999%, at least 99.999999999999%, at least 99.9999999999999%, or a confidence above 99.9999999999999%.

In some embodiments, the method further comprises generating a paper or electronic report identifying one or more unknown proteins in the sample. The paper or electronic report may further indicate, for each candidate protein, a confidence level that the candidate protein is present in the sample. The confidence level may include a probability value. Alternatively, the confidence level may include a probability value with an error. Alternatively, the confidence level may include a range of probability values, optionally with a confidence level (e.g., about 90%, about 95%, about 96%, about 97%, about 98%, about 99%, about 99.9%, about 99.99%, about 99.999%, about 99.9999%, about 99.99999%, about 99.999999%, about 99.9999999%, about 99.999999%, about 99.999999999%, about 99.99999999%, about 99.9999999%, about 99.999999%, about 99.99999999999%, about 99.999999999999%, about 99.9999999999999% confidence or above 99.9999999999999% confidence). The paper or electronic report may further indicate a list of protein candidates identified below an expected false discovery rate threshold (e.g., a false discovery rate of less than 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, 1%, 0.5%, 0.4%, 0.3%, 0.2%, or 0.1%). The false discovery rate can be estimated as follows: protein identifications were first ranked in descending order of confidence. The estimated false discovery rate for any point in the sorted list can then be calculated as 1-avg _ c _ prob, where avg _ c _ prob is the average candidate probability for all proteins at or before (e.g., with a higher confidence than) the current point in the list. A list of protein identifications below the desired threshold of false discovery rate can then be generated by returning all protein identifications in the sorted list prior to the earliest point in the list at which the false discovery rate is above the threshold. Alternatively, a list of protein identifications below the desired threshold of false discovery rate may be generated by returning all proteins in the sorted list before (including) the last point where the rate of false discovery is less than or equal to the desired threshold.

In some embodiments, the sample comprises a biological sample. The biological sample may be obtained from a subject. In some embodiments, the method further comprises determining a disease state or condition in the subject based at least on the plurality of probabilities. In some embodiments, the method further comprises quantifying the protein by counting the number of identifications made for each protein candidate. For example, the absolute amount of protein (e.g., number of protein molecules) present in a sample can be calculated by counting the number of confident identifications generated from the protein candidate. In some embodiments, the amount may be calculated as a percentage of the total number of unknown proteins determined. In some embodiments, the raw authentication counts may be calibrated to remove systematic errors from the instrument and detection system. In some embodiments, the amount may be calibrated to eliminate amount bias caused by detectable changes in the protein candidate. The detectability of proteins can be assessed by empirical measurements or computer simulations.

The disease or disorder can be an infectious disease, an immune disorder or disease, a cancer, a genetic disease, a degenerative disease, a lifestyle disease, an injury, a rare disease, or an age-related disease. The infectious disease may be caused by bacteria, viruses, fungi and/or parasites. Non-limiting examples of cancer include bladder cancer, lung cancer, brain cancer, melanoma, breast cancer, non-hodgkin's lymphoma, cervical cancer, ovarian cancer, colorectal cancer, pancreatic cancer, esophageal cancer, prostate cancer, renal cancer, skin cancer, leukemia, thyroid cancer, liver cancer, and uterine cancer. Some examples of genetic diseases or disorders include, but are not limited to, Multiple Sclerosis (MS), cystic fibrosis, Charcot-Marie-Tooth disease, Huntington's disease, Peutz-Jeghers syndrome, Down's syndrome, rheumatoid arthritis, and Tay-Sachs disease. Non-limiting examples of lifestyle diseases include obesity, diabetes, arteriosclerosis, heart disease, stroke, hypertension, cirrhosis, nephritis, cancer, chronic obstructive pulmonary disease (copd), hearing problems, and chronic back pain. Some examples of injuries include, but are not limited to, abrasions, brain injuries, bruises, burns, concussions, congestive heart failure, architectural injuries, dislocations, flail chest, bone fractures, hemothorax, herniated disc, coxal bulge contusions, hypothermia, tears, nerve pinches, pneumothorax, rib fractures, sciatica, spinal cord injuries, tendon ligament fascia injuries, traumatic brain injuries, and whiplash injuries.

In some embodiments, the method comprises identifying and quantifying a small molecule (e.g., a metabolite) or glycan instead of or in addition to a protein. For example, glycans can be identified using affinity reagents, such as lectins or antibodies that bind to sugars or combinations of sugars with different tendencies. The propensity of an affinity reagent to bind to various sugars or combinations of sugars can be characterized by analysis of binding to commercially available glycan arrays. For example, unknown glycans can be conjugated to functionalized substrates using hydroxyl-reactive chemistry, and binding measurements can be obtained using glycan binding affinity reagents. The binding measurements of the affinity reagent to unknown glycans on the substrate can be used directly to quantify the number of glycans with a particular saccharide or combination of saccharides. Alternatively, one or more binding measurements can be compared to binding measurements predicted from a database of candidate glycan structures using the methods described herein to identify the structure of each unknown glycan. In some embodiments, a protein is bound to a substrate and a binding measurement is performed with a glycan affinity reagent to identify glycans attached to the protein. In addition, binding measurements can be performed in a single experiment using both glycan and protein affinity reagents to generate protein backbone sequences and conjugated glycan identifications. As another example, the metabolites may be conjugated to functionalized substrates using chemistry directed to coupling groups common in metabolites such as sulfhydryl, carbonyl, amine or active hydrogen. Binding measurements can be performed using affinity reagents having different tendencies for particular functional groups, structural motifs, or metabolites. The resulting binding measurements can be compared to predicted binding measurements of a database of candidate small molecules, and metabolites at each location on the substrate can be identified using the methods described herein.

105页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:用于识别供植物育种使用的杂交种的方法和系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!