Machine learning enabled biopolymer assembly

文档序号:927740 发布日期:2021-03-02 浏览:23次 中文

阅读说明:本技术 机器学习使能的生物聚合物组装 (Machine learning enabled biopolymer assembly ) 是由 明·迪克·曹 于 2019-05-13 设计创作,主要内容包括:描述了用于生成高分子的生物聚合物组装的机器学习技术。例如,系统可以使用机器学习技术来生成有机体DNA的基因组组装、有机体DNA的一部分的基因序列或蛋白质的氨基酸序列。所述系统可以访问由测序装置生成的生物聚合物序列和从所述序列生成的组装。所述系统可使用所述序列和所述组装生成机器学习模型的输入。所述系统可将所述输入提供给所述机器学习模型,以获得相应的输出。所述系统可以使用所述相应的输出来标识所述组装中的位置处的生物聚合物。并且随后更新所述组装,以在所述组装中的所述位置处指示所标识的生物聚合物,以获得更新的组装。(Machine learning techniques for generating high molecular biopolymer assemblies are described. For example, the system may use machine learning techniques to generate genomic assemblies of organism DNA, gene sequences of a portion of organism DNA, or amino acid sequences of proteins. The system can access biopolymer sequences generated by a sequencing device and assemblies generated from the sequences. The system may generate an input to a machine learning model using the sequence and the assembly. The system may provide the inputs to the machine learning model to obtain corresponding outputs. The system can use the respective outputs to identify a biopolymer at a location in the assembly. And subsequently updating the assembly to indicate the identified biopolymer at the location in the assembly to obtain an updated assembly.)

1. A method of generating a macromolecular biopolymer assembly, the method comprising:

performing, using at least one computer hardware processor:

accessing a plurality of biopolymer sequences and assemblies, the assemblies indicating biopolymers present at respective assembly locations;

generating a first input to be provided to a trained deep learning model using the plurality of biopolymer sequences and the assembly;

providing the first input to the trained deep learning model to obtain a corresponding first output, wherein the first output indicates, for each assembly location of a plurality of first assembly locations, one or more likelihoods that each of one or more respective biopolymers is present at the location;

identifying biopolymers at the plurality of first assembly locations using the first output of the trained deep learning model; and

updating the assembly to indicate the identified biopolymer at the plurality of first assembly locations to obtain an updated assembly.

2. The method of claim 1, wherein the macromolecule comprises a protein, the plurality of biopolymer sequences comprises a plurality of amino acid sequences, and the assembling indicates amino acids at corresponding assembly positions.

3. The method of claim 1 or any other preceding claim, wherein the macromolecule comprises a nucleic acid, the plurality of biopolymer sequences comprises a plurality of nucleotide sequences, and the assembling indicates nucleotides at respective assembly positions.

4. The method of claim 3 or any other preceding claim, wherein:

the assembly indicates a first nucleotide at a first assembly position of the plurality of first assembly positions;

identifying a biopolymer at the plurality of first assembly locations comprises identifying a second nucleotide at the first assembly location; and

the updating the assembly comprises updating the assembly to indicate the second nucleotide at the first assembly position.

5. The method of claim 3 or any other preceding claim, wherein after updating the assembly to obtain an updated assembly, the method further comprises:

aligning the plurality of nucleotide sequences to the updated assembly;

generating a second input to be provided to the trained deep learning model using the plurality of nucleotide sequences and the updated assembly;

providing the second input to the trained deep learning model to obtain a respective second output, wherein the second output indicates, for each assembly position of a plurality of second assembly positions, one or more likelihoods that each of one or more respective nucleotides is present at the position;

identifying nucleotides at the plurality of second assembly positions based on the second output of the trained deep learning model; and

updating the updated assembly to indicate the identified nucleotides at the plurality of second assembly positions to obtain a second updated assembly.

6. The method of claim 3 or any other preceding claim, further comprising aligning the plurality of nucleotide sequences to the assembly.

7. The method of claim 6 or any other preceding claim, wherein the plurality of nucleotide sequences comprises at least 9 nucleotide sequences.

8. The method of claim 3 or any other preceding claim, wherein generating the first input of the trained deep learning model comprises:

selecting the plurality of first assembly locations; and

generating the first input based on the selected plurality of first assembly locations.

9. The method of claim 8 or any other preceding claim, wherein selecting the plurality of first locations in the assembly comprises:

determining a likelihood that the assembly incorrectly indicates a nucleotide at the plurality of first assembly positions; and

selecting the plurality of first assembly locations using the determined likelihoods.

10. The method of claim 3 or any other preceding claim, wherein generating a first input to be provided to the trained deep learning model comprises: comparing a corresponding nucleotide sequence of the plurality of nucleotide sequences to the assembly.

11. The method of claim 3 or any other preceding claim, wherein generating a first input to be provided to the trained deep learning model to identify a nucleotide at a first assembly position of the plurality of first assembly positions comprises:

for each of a plurality of nucleotides located at each of one or more assembly positions in the neighborhood of the first assembly position:

determining a count indicative of a number of a plurality of nucleotide sequences, wherein the plurality of nucleotide sequences indicate that the nucleotide is located at the position;

determining a reference value based on whether the assembly indicates the nucleotide at the position;

determining an error value indicative of a difference between the count and the reference value; and

the reference value and the error value are included in a first input.

12. The method of claim 11 or any other preceding claim, wherein determining a reference value based on whether the assembly indicates the nucleotide at the location comprises:

determining the reference value as a first value when the assembly indicates the nucleotide at the position; and

determining the reference value as a second value when the assembly does not indicate the nucleotide at the location.

13. The method of claim 12 or any other preceding claim, wherein:

the first value is the number of the plurality of nucleotide sequences; and

the second value is 0.

14. The method of claim 11 or any other preceding claim, wherein generating the first input to be provided to the trained deep learning model comprises arranging values into a data structure having columns, wherein:

a first column holding reference and error values determined for a plurality of nucleotides at the first assembly position; and

the second column holds reference values and error values for a plurality of nucleotides at a second assembly position of the one or more assembly positions in the neighborhood of the first assembly position.

15. The method of claim 11 or any other preceding claim, wherein one or more of the neighborhood of the first assembly locations comprises at least two assembly locations separate from the first assembly location.

16. The method of claim 3 or any other preceding claim, wherein the one or more likelihoods that each of one or more respective biopolymers is present at the assembly location comprises, for each of a plurality of nucleotides, a likelihood that the nucleotide is present at the assembly location; and is

Identifying a biopolymer at the first plurality of assembly locations comprises: identifying a nucleotide at a first assembly position of the plurality of first assembly positions as a first nucleotide of the plurality of nucleotides by determining that the likelihood that the first nucleotide is present at the first position is greater than the likelihood that a second nucleotide of the plurality of nucleotides is present at the first assembly position.

17. The method of claim 3 or any other preceding claim, further comprising generating the assembly from the plurality of nucleotide sequences.

18. The method of claim 17 or any other preceding claim, wherein generating the assembly from the plurality of nucleotide sequences comprises: determining a consensus sequence from the plurality of nucleotide sequences as the assembly.

19. The method of claim 17 or any other preceding claim, wherein generating the assembly from the plurality of nucleotide sequences comprises: applying an overlap-and-extend algorithm to the plurality of nucleotide sequences.

20. The method of claim 1 or any other preceding claim, further comprising:

accessing training data comprising biopolymer sequences obtained from sequencing a reference macromolecule and a predetermined assembly of the reference macromolecule; and

training a deep learning model using the training data to obtain the trained deep learning model.

21. The method of claim 20 or any other preceding claim, wherein the reference macromolecule is different from the macromolecule.

22. The method of claim 1 or any other preceding claim, wherein the deep learning model comprises a convolutional neural network.

23. A system for generating a biopolymer assembly of macromolecules, the system comprising:

at least one computer hardware processor; and

at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to:

accessing a plurality of biopolymer sequences and assemblies, the assemblies indicating the presence of a biopolymer at respective assembly locations;

generating a first input to be provided to a trained deep learning model using the plurality of biopolymer sequences and the assembly;

providing the first input to the trained deep learning model to obtain a corresponding first output, wherein the first output indicates, for each assembly location of a plurality of first assembly locations, one or more likelihoods that each of one or more respective biopolymers is present at the location;

identifying biopolymers at the plurality of first assembly locations using the first output of the trained deep learning model; and

updating the assembly to indicate the identified biopolymer at the plurality of first assembly locations to obtain an updated assembly.

24. The system of claim 23, wherein the macromolecules comprise proteins, the plurality of biopolymer sequences comprise a plurality of amino acid sequences, and the assemblies indicate amino acids at corresponding assembly positions.

25. The system of claim 23 or any other preceding claim, wherein the macromolecule comprises a nucleic acid, the plurality of biopolymer sequences comprises a plurality of nucleotide sequences, and the assembly indicates nucleotides at respective assembly positions.

26. The system of claim 25 or any other preceding claim, wherein:

the assembly indicates a first nucleotide at a first assembly position of the plurality of first assembly positions;

identifying a biopolymer at the plurality of first assembly locations comprises identifying a second nucleotide at the first assembly location; and

the updating the assembly comprises updating the assembly to indicate the second nucleotide at the first assembly position.

27. The system of claim 25 or any other preceding claim, wherein, after updating the assembly to obtain the updated assembly, the instructions further cause the at least one computer hardware processor to:

aligning the plurality of nucleotide sequences to the updated assembly;

generating a second input to be provided to the trained deep learning model using the plurality of nucleotide sequences and the updated assembly;

providing the second input to the trained deep learning model to obtain a respective second output, wherein the second output indicates, for each assembly position of a plurality of second assembly positions, one or more likelihoods that each of one or more respective nucleotides is present at the position;

identifying nucleotides at the plurality of second assembly positions based on the second output of the trained deep learning model; and

updating the updated assembly to indicate the identified nucleotides at the plurality of second assembly positions to obtain a second updated assembly.

28. The system of claim 25 or any other preceding claim, wherein the instructions further cause at least one computer hardware processor to: aligning the plurality of nucleotide sequences to the assembly.

29. The system of claim 28 or any other preceding claim, wherein the plurality of nucleotide sequences comprises at least 9 nucleotide sequences.

30. The system of claim 25 or any other preceding claim, wherein generating the first input of the trained deep learning model comprises:

selecting the plurality of first assembly locations; and

generating the first input based on the selected plurality of first assembly locations.

31. The system of claim 30 or any other preceding claim, wherein selecting a plurality of first locations in the assembly comprises:

determining a likelihood that the assembly incorrectly indicates a nucleotide at the plurality of first assembly positions; and

selecting the plurality of first assembly locations using the determined likelihoods.

32. The system of claim 25 or any other preceding claim, wherein generating the first input to be provided to the trained deep learning model comprises: comparing a corresponding nucleotide sequence of the plurality of nucleotide sequences to the assembly.

33. The system of claim 25 or any other preceding claim, wherein generating a first input to be provided to the trained deep learning model to identify a nucleotide at a first one of the plurality of first assembly locations comprises:

for each of a plurality of nucleotides at each of one or more assembly positions in the neighborhood of the first assembly position:

determining a count indicative of a number of a plurality of nucleotide sequences indicative of the nucleotide being located at the position;

determining a reference value based on whether the assembly indicates the nucleotide at the position;

determining an error value indicative of a difference between the count and the reference value; and

including the reference value and the error value in the first input.

34. The system of claim 33 or any other preceding claim, wherein determining a reference value based on whether the assembly indicates the nucleotide at the location comprises:

determining the reference value as a first value when the assembly indicates the nucleotide at the position; and

determining the reference value as a second value when the assembly does not indicate the nucleotide at the location.

35. The system of claim 34 or any other preceding claim, wherein:

the first value is the number of the plurality of nucleotide sequences; and

the second value is 0.

36. The system of claim 33 or any other preceding claim, wherein generating the first input to be provided to the trained deep learning model comprises arranging values into a data structure having columns, wherein:

a first column holding reference and error values determined for a plurality of nucleotides at a first assembly position; and

the second column holds reference values and error values for a plurality of nucleotides at a second assembly position of the one or more assembly positions in the neighborhood of the first assembly position.

37. The system of claim 33 or any other preceding claim, wherein one or more of the neighborhood of the first assembly locations comprises at least two assembly locations separate from the first assembly location.

38. The system of claim 25 or any other preceding claim, wherein the one or more likelihoods that each of one or more respective biopolymers is present at the assembly location comprises, for each of a plurality of nucleotides, a likelihood that the nucleotide is present at the assembly location; and is

Identifying a biopolymer at the first plurality of assembly locations comprises: identifying a nucleotide at a first assembly position of the plurality of first assembly positions as a first nucleotide of the plurality of nucleotides by determining that the likelihood that the first nucleotide is present at the first position is greater than the likelihood that a second nucleotide of the plurality of nucleotides is present at the first assembly position.

39. The system of claim 25 or any other preceding claim, wherein the instructions further cause the at least one computer hardware processor to: generating the assembly from the plurality of nucleotide sequences.

40. The system of claim 39 or any other preceding claim, wherein generating the assembly from the plurality of nucleotide sequences comprises: determining a consensus sequence from the plurality of nucleotide sequences as the assembly.

41. The system of claim 39 or any other preceding claim, wherein generating the assembly from a plurality of nucleotide sequences comprises: applying an overlap-and-extend algorithm to the plurality of nucleotide sequences.

42. The system of claim 23 or any other preceding claim, wherein the instructions further cause the at least one computer hardware processor to:

accessing training data comprising biopolymer sequences obtained from sequencing a reference macromolecule and a predetermined assembly of the reference macromolecule; and

training a deep learning model using the training data to obtain the trained deep learning model.

43. The system of claim 42 or any other preceding claim, wherein the reference macromolecule is different from the macromolecule.

44. The system of claim 23 or any other preceding claim, wherein the deep learning model comprises a convolutional neural network.

45. At least one non-transitory computer-readable storage medium storing instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method of generating a macromolecular biopolymer assembly, the method comprising:

accessing a plurality of biopolymer sequences and assemblies, the assemblies indicating the presence of a biopolymer at respective assembly locations;

generating a first input to be provided to a trained deep learning model using the plurality of biopolymer sequences and the assembly;

providing the first input to the trained deep learning model to obtain a corresponding first output, wherein the first output indicates, for each assembly location of a plurality of first assembly locations, one or more likelihoods that each of one or more respective biopolymers is present at the location;

identifying biopolymers at the plurality of first assembly locations using the first output of the trained deep learning model; and

updating the assembly to indicate the identified biopolymer at the plurality of first assembly locations to obtain an updated assembly.

46. The at least one non-transitory computer-readable storage medium of claim 45, wherein the macromolecule comprises a protein, the plurality of biopolymer sequences comprise a plurality of amino acid sequences, and the assemblies indicate amino acids at corresponding assembly positions.

47. The at least one non-transitory computer-readable storage medium of claim 45 or any other preceding claim, wherein the macromolecule comprises a nucleic acid, the plurality of biopolymer sequences comprises a plurality of nucleotide sequences, and the assembling indicates nucleotides at respective assembly positions.

48. The at least one non-transitory computer-readable storage medium of claim 47 or any other preceding claim, wherein:

the assembly indicates a first nucleotide at a first assembly position of the plurality of first assembly positions;

identifying a biopolymer at the plurality of first assembly locations comprises identifying a second nucleotide at the first assembly location; and

the updating the assembly comprises updating the assembly to indicate the second nucleotide at the first assembly position.

49. The at least one non-transitory computer-readable storage medium of claim 47 or any other preceding claim, wherein, after updating the assembly to obtain an updated assembly, the method further comprises:

aligning the plurality of nucleotide sequences to the updated assembly;

generating a second input to be provided to the trained deep learning model using the plurality of nucleotide sequences and the updated assembly;

providing the second input to the trained deep learning model to obtain a respective second output, wherein the second output indicates, for each assembly position of a plurality of second assembly positions, one or more likelihoods that each of one or more respective nucleotides is present at the position;

identifying nucleotides at the plurality of second assembly positions based on the second output of the trained deep learning model; and

updating the updated assembly to indicate the identified nucleotides at the plurality of second assembly positions to obtain a second updated assembly.

50. The at least one non-transitory computer-readable storage medium of claim 47 or any other preceding claim, wherein the method further comprises aligning the plurality of nucleotide sequences to the assembly.

51. The at least one non-transitory computer-readable storage medium of claim 50 or any other preceding claim, wherein the plurality of nucleotide sequences comprises at least 9 nucleotide sequences.

52. The at least one non-transitory computer-readable storage medium of claim 47 or any other preceding claim, wherein generating the first input of the trained deep learning model comprises:

selecting the plurality of first assembly locations; and

generating the first input based on the selected plurality of first assembly locations.

53. The at least one non-transitory computer-readable storage medium of claim 52 or any other preceding claim, wherein selecting a plurality of first locations in the assembly comprises:

determining a likelihood that the assembly incorrectly indicates a nucleotide at a plurality of first assembly positions; and

selecting the plurality of first assembly locations using the determined likelihoods.

54. The at least one non-transitory computer-readable storage medium of claim 47 or any other preceding claim, wherein generating the first input to be provided to the trained deep learning model comprises: comparing a corresponding nucleotide sequence of the plurality of nucleotide sequences to the assembly.

55. The at least one non-transitory computer-readable storage medium of claim 47 or any other preceding claim, wherein generating a first input to be provided to the trained deep learning model to identify a nucleotide at a first assembly location of the plurality of first assembly locations comprises:

for each of a plurality of nucleotides at each of one or more assembly positions in the neighborhood of the first assembly position:

determining a count indicative of a number of a plurality of nucleotide sequences indicative of the nucleotide at the position;

determining a reference value based on whether the assembly indicates the nucleotide at the position;

determining an error value indicative of a difference between the count and the reference value; and

including the reference value and the error value in the first input.

56. The at least one non-transitory computer-readable storage medium of claim 55 or any other preceding claim, wherein determining a reference value based on whether the assembly indicates the nucleotide at the location comprises:

determining the reference value as a first value when the assembly indicates the nucleotide at the position; and

determining the reference value as a second value when the assembly does not indicate the nucleotide at the location.

57. The at least one non-transitory computer-readable storage medium of claim 56 or any other preceding claim, wherein:

the first value is the number of the plurality of nucleotide sequences; and

the second value is 0.

58. The at least one non-transitory computer-readable storage medium of claim 55 or any other preceding claim, wherein generating the first input to be provided to the trained deep learning model comprises arranging values into a data structure having columns, wherein:

a first column holding reference and error values determined for a plurality of nucleotides at the first assembly position; and

a second column holds reference values and error values for a plurality of nucleotides at a second one of the one or more assembly positions in the neighborhood of the first assembly position.

59. The at least one non-transitory computer-readable storage medium of claim 55 or any other preceding claim, wherein one or more assembly locations in the vicinity of the first assembly location comprise at least two assembly locations separate from the first assembly location.

60. The at least one non-transitory computer-readable storage medium of claim 47 or any other preceding claim, wherein the one or more likelihoods that, for each nucleotide of the plurality of nucleotides, each of one or more respective biopolymers is present at the assembly location comprises: the likelihood of the nucleotide being present at the assembly position; and

identifying a biopolymer at the first plurality of assembly locations comprises: identifying the nucleotide at the first assembly position of the plurality of first assembly positions as a first nucleotide of the plurality of nucleotides by determining that the likelihood that the first nucleotide is present at the first position is greater than the likelihood that the second nucleotide of the plurality of nucleotides is present at the first assembly position.

61. The at least one non-transitory computer-readable storage medium of claim 47 or any other preceding claim, wherein the method further comprises: generating the assembly from the plurality of nucleotide sequences.

62. The at least one non-transitory computer-readable storage medium of claim 61 or any other preceding claim, wherein generating the assembly from the plurality of nucleotide sequences comprises: determining a consensus sequence from the plurality of nucleotide sequences as the assembly.

63. The at least one non-transitory computer-readable storage medium of claim 61 or any other preceding claim, wherein generating the assembly from the plurality of nucleotide sequences comprises: an overlap-and-expand algorithm is applied to a plurality of nucleotide sequences.

64. The at least one non-transitory computer-readable storage medium of claim 45 or any other preceding claim, wherein the method further comprises:

accessing training data comprising biopolymer sequences obtained from sequencing a reference macromolecule and a predetermined assembly of the reference macromolecule; and

training a deep learning model using the training data to obtain the trained deep learning model.

65. The at least one non-transitory computer-readable storage medium of claim 64 or any other preceding claim, wherein the reference macromolecule is different from the macromolecule.

66. The at least one non-transitory computer-readable storage medium of claim 45 or any other preceding claim, wherein the deep learning model comprises a convolutional neural network.

Background

The present disclosure relates to the assembly of biopolymers (e.g., genomic assemblies, nucleotide sequences, or protein sequences) to produce macromolecules (e.g., nucleic acids or proteins). The sequencing device can generate sequencing data that can be used to generate the assembly. As one example, sequencing data may comprise nucleotide sequences of deoxyribonucleic acid (DNA) from a biological sample that may be used to assemble (whole or part) a genome. As another example, sequencing data may comprise amino acid sequences that can be used to assemble (in whole or in part) protein sequences.

Disclosure of Invention

According to one aspect, a method of generating a macromolecular biopolymer assembly is provided. The method comprises the following steps: using at least one computer hardware processor to execute; accessing a plurality of biopolymer sequences and assemblies, the assemblies indicating biopolymers present at respective assembly locations; generating a first input to be provided to the trained deep learning model using the plurality of biopolymer sequences and the assembly; providing a first input to the trained deep learning model to obtain a respective first output, the first output indicating, for each of a plurality of first assembly locations, one or more likelihoods that each of one or more respective biopolymers is present at that location; identifying a biopolymer at a plurality of first assembly locations using a first output of a trained deep learning model; and updating the assembly to indicate the identified biopolymer at the plurality of first assembly locations to obtain an updated assembly.

According to one embodiment, the macromolecule comprises a protein, the plurality of biopolymer sequences comprises a plurality of amino acid sequences, and the assembly indicates the amino acids at corresponding assembly positions.

According to one embodiment, the macromolecule comprises a nucleic acid, the plurality of biopolymer sequences comprises a plurality of nucleotide sequences, and the assembling indicates the nucleotides at the respective assembly positions.

According to one embodiment, the assembly indicates a first nucleotide at a first of the plurality of first assembly positions; identifying the biopolymer at a plurality of first assembly positions includes identifying a second nucleotide at the first assembly position; and updating the assembly includes updating the assembly to indicate the second nucleotide at the first assembly position.

According to one embodiment, the method further comprises, after updating the assembly to obtain an updated assembly: aligning the plurality of nucleotide sequences to the updated assembly; generating a second input to be provided to the trained deep learning model using the plurality of nucleotide sequences and the updated assembly; providing a second input to the trained deep learning model to obtain a respective second output, the second output indicating, for each assembly position of the plurality of second assembly positions, one or more likelihoods that each of the one or more respective nucleotides is present at that position; identifying nucleotides at a plurality of second assembly locations based on a second output of the trained deep learning model; and updating the updated assembly to indicate the identified nucleotides at a plurality of second assembly positions to obtain a second updated assembly.

According to one embodiment, the method further comprises aligning the plurality of nucleotide sequences to an assembly. According to one embodiment, the plurality of nucleotide sequences comprises at least 5 nucleotide sequences. According to one embodiment, the plurality of nucleotide sequences comprises at least 9 nucleotide sequences. According to one embodiment, the plurality of nucleotide sequences comprises at least 10 nucleotide sequences.

According to one embodiment, generating the first input of the trained deep learning model comprises: selecting a plurality of first assembly locations; and generating a first input based on the selected plurality of first assembly locations. According to one embodiment, selecting the plurality of first locations in the assembling comprises: determining a likelihood of assembling an erroneously indicated nucleotide at a plurality of first assembly positions; and selecting a plurality of first assembly locations using the determined likelihoods.

According to one embodiment, generating the first input to be provided to the trained deep learning model comprises comparing respective nucleotide sequences of the plurality of nucleotide sequences to the assembly. According to one embodiment, generating a first input to be provided to the trained deep learning model to identify a nucleotide at a first one of the plurality of first assembly positions comprises: for each of a plurality of nucleotides at each of one or more assembly positions in the neighborhood of the first assembly position: determining a count indicative of the number of nucleotide sequences at which the sequence is indicative of a nucleotide; determining a reference value based on whether the assembly indicates the nucleotide at the position; determining an error value indicative of a difference between the count and the reference value; and including the reference value and the error value in the first input.

According to one embodiment, determining the reference value based on whether the assembly indicates the nucleotide at the position comprises: determining the reference value as a first value when the assembly indicates the nucleotide at the position; and determining the reference value as a second value when the assembly at that position does not indicate the nucleotide. According to one embodiment, the first value is the number of said plurality of nucleotide sequences; the second value is 0.

According to one embodiment, generating a first input to be provided to the trained deep learning model comprises arranging values into a data structure having columns, wherein: a first column holding reference and error values determined for a plurality of nucleotides at a first assembly position; the second column holds reference values and error values for a plurality of nucleotides at a second assembly position of the one or more assembly positions in the neighborhood of the first assembly position. According to one embodiment, the one or more assembly locations in the vicinity of the first assembly location comprise at least two assembly locations separate from the first assembly location.

According to one embodiment, the one or more possibilities for each of the one or more respective biopolymers to be present at the assembly position for each of the plurality of nucleotides comprises: the likelihood of the nucleotide being present at the assembly position; and identifying the biopolymer at the first plurality of assembly locations comprises: identifying the nucleotide at the first of the plurality of first assembly positions as a first nucleotide of the plurality of nucleotides by determining that the likelihood that the first nucleotide is present at the first position is greater than the likelihood that the second of the plurality of nucleotides is present at the first assembly position.

According to one embodiment, the method further comprises generating an assembly from the plurality of nucleotide sequences. According to one embodiment, generating an assembly from a plurality of nucleotide sequences comprises determining a consensus sequence (consensus sequence) from the plurality of nucleotide sequences as the assembly. According to one embodiment, generating an assembly from a plurality of nucleotide sequences comprises applying an overlap-and-expand (OLC) algorithm to the plurality of nucleotide sequences.

According to one embodiment, the method further comprises: accessing training data comprising biopolymer sequences obtained from sequencing a reference macromolecule and a predetermined assembly of reference macromolecules; and training the deep learning model using the training data to obtain a trained deep learning model. According to one embodiment, the reference macromolecule is different from the macromolecule. According to one embodiment, the deep learning model includes a Convolutional Neural Network (CNN).

According to another aspect, a system for biopolymer assembly for the generation of macromolecules is provided. The system comprises: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: accessing a plurality of biopolymer sequences and assemblies indicating the presence of a biopolymer at respective assembly locations; generating a first input to be provided to the trained deep learning model using the plurality of biopolymer sequences and the assembly; providing a first input to the trained deep learning model to obtain a respective first output indicative of, for each of a plurality of first assembly locations, one or more likelihoods that each of one or more respective biopolymers is present at that location; identifying a biopolymer at a plurality of first assembly locations using a first output of a trained deep learning model; and updating the assembly to indicate the identified biopolymer at the plurality of first assembly locations to obtain an updated assembly.

According to one embodiment, the macromolecule comprises a protein, the plurality of biopolymer sequences comprises a plurality of amino acid sequences, and the assembly indicates the amino acids at corresponding assembly positions.

According to one embodiment, the macromolecule comprises a nucleic acid, the plurality of biopolymer sequences comprises a plurality of nucleotide sequences, and the assembling indicates the nucleotides at the respective assembly positions.

According to one embodiment, the assembly indicates a first nucleotide at a first one of the plurality of first assembly positions; identifying the biopolymer at a plurality of first assembly positions includes identifying a second nucleotide at the first assembly position; and updating the assembly includes updating the assembly to indicate the second nucleotide at the first assembly position.

According to one embodiment, the instructions further cause the at least one computer hardware processor to, after updating the assembly to obtain an updated assembly: aligning the plurality of nucleotide sequences to the updated assembly; generating a second input to be provided to the trained deep learning model using the plurality of nucleotide sequences and the updated assembly; providing a second input to the trained deep learning model to obtain a respective second output indicative of, for each assembly position of the plurality of second assembly positions, one or more likelihoods that each of the one or more respective nucleotides is present at that position; identifying nucleotides at a plurality of second assembly locations based on a second output of the trained deep learning model; and updating the updated assembly to indicate the identified nucleotides at a plurality of second assembly positions to obtain a second updated assembly.

According to one embodiment, the instructions further cause the at least one computer hardware processor to perform aligning the plurality of nucleotide sequences to the assembly. According to one embodiment, the plurality of nucleotide sequences comprises at least 5 nucleotide sequences. According to one embodiment, the plurality of nucleotide sequences comprises at least 9 nucleotide sequences. According to one embodiment, the plurality of nucleotide sequences comprises at least 10 nucleotide sequences.

According to one embodiment, generating the first input to the trained deep learning model comprises: selecting a plurality of first assembly locations; and generating a first input based on the selected plurality of first assembly locations. According to one embodiment, selecting the plurality of first locations in the assembling comprises: determining a likelihood that assembly at the plurality of first assembly positions falsely indicates a nucleotide; and selecting a plurality of first assembly locations using the determined likelihoods.

According to one embodiment, generating the first input to be provided to the trained deep learning model comprises comparing respective nucleotide sequences of the plurality of nucleotide sequences to the assembly. According to one embodiment, generating a first input to be provided to the trained deep learning model to identify a nucleotide at a first one of the plurality of first assembly positions comprises: for each of a plurality of nucleotides at each of one or more assembly positions in the neighborhood of the first assembly position: determining a count indicative of the number of nucleotide sequences at which the sequence is indicative of a nucleotide; determining a reference value based on whether the assembly indicates the nucleotide at the position; determining an error value indicative of a difference between the count and the reference value; and including the reference value and the error value in the first input. According to one embodiment, determining the reference value based on whether the assembly indicates the nucleotide at the position comprises: determining the reference value as a first value when the assembly indicates the nucleotide at the position; and determining the reference value as a second value when the assembly does not indicate the nucleotide at the position. According to one embodiment, the first value is the number of the plurality of nucleotide sequences; the second value is 0. According to one embodiment, generating a first input to be provided to the trained deep learning model comprises arranging values into a data structure having columns, wherein: a first column holding reference and error values determined for a plurality of nucleotides at a first assembly position; the second column holds reference values and error values for a plurality of nucleotides at a second assembly position of the one or more assembly positions in the neighborhood of the first assembly position. According to one embodiment, the one or more assembly locations in the vicinity of the first assembly location comprise at least two assembly locations separate from the first assembly location.

According to one embodiment, the one or more possibilities for each of the one or more respective biopolymers to be present at the assembly position for each of the plurality of nucleotides comprises: the likelihood of the nucleotide being present at the assembly position; and identifying the biopolymer at the first plurality of assembly locations comprises: identifying the nucleotide at the first of the plurality of first assembly positions as a first nucleotide of the plurality of nucleotides by determining that the likelihood that the first nucleotide is present at the first position is greater than the likelihood that the second of the plurality of nucleotides is present at the first assembly position.

According to one embodiment, the instructions further cause the at least one computer hardware processor to perform generating the assembly from the plurality of nucleotide sequences. According to one embodiment, generating an assembly from a plurality of nucleotide sequences comprises determining a consensus sequence from the plurality of nucleotide sequences as the assembly. According to one embodiment, generating an assembly from a plurality of nucleotide sequences comprises applying an overlap-and-expand-after-expand (OLC) algorithm to the plurality of nucleotide sequences.

According to one embodiment, the instructions further cause the at least one computer hardware processor to perform: accessing training data comprising biopolymer sequences obtained from sequencing a reference macromolecule and a predetermined assembly of reference macromolecules; and training the deep learning model using the training data to obtain a trained deep learning model. According to one embodiment, the reference macromolecule is different from the macromolecule. According to one embodiment, the deep learning model includes a Convolutional Neural Network (CNN).

According to another aspect, a non-transitory computer-readable storage medium is provided. A non-transitory computer-readable storage medium stores instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform a method of generating a macromolecular biopolymer assembly. The method comprises the following steps: accessing a plurality of biopolymer sequences and assemblies indicating the presence of a biopolymer at respective assembly locations; generating a first input to be provided to the trained deep learning model using the plurality of biopolymer sequences and the assembly; providing a first input to the trained deep learning model to obtain a respective first output indicative of, for each of a plurality of first assembly locations, one or more likelihoods that each of one or more respective biopolymers is present at that location; identifying a biopolymer at a plurality of first assembly locations using a first output of a trained deep learning model; and updating the assembly to indicate the identified biopolymer at the plurality of first assembly locations to obtain an updated assembly.

According to one embodiment, the macromolecule comprises a protein, the plurality of biopolymer sequences comprises a plurality of amino acid sequences, and the assembly indicates the amino acids at corresponding assembly positions.

According to one embodiment, the macromolecule comprises a nucleic acid, the plurality of biopolymer sequences comprises a plurality of nucleotide sequences, and the assembling indicates the nucleotides at the respective assembly positions.

According to one embodiment, the assembly indicates a first nucleotide at a first one of the plurality of first assembly positions; identifying the biopolymer at a plurality of first assembly positions includes identifying a second nucleotide at the first assembly position; and updating the assembly includes updating the assembly to indicate the second nucleotide at the first assembly position.

According to one embodiment, the method further comprises, after updating the assembly to obtain an updated assembly: aligning the plurality of nucleotide sequences to the updated assembly; generating a second input to be provided to the trained deep learning model using the plurality of nucleotide sequences and the updated assembly; providing a second input to the trained deep learning model to obtain a respective second output, the second output indicating, for each assembly position of the plurality of second assembly positions, one or more likelihoods that each of the one or more respective nucleotides is present at that position; identifying nucleotides at a plurality of second assembly locations based on a second output of the trained deep learning model; and updating the updated assembly to indicate the identified nucleotides at a plurality of second assembly positions to obtain a second updated assembly.

According to one embodiment, the method further comprises aligning the plurality of nucleotide sequences to an assembly. According to one embodiment, the plurality of nucleotide sequences comprises at least 5 nucleotide sequences. According to one embodiment, the plurality of nucleotide sequences comprises at least 9 nucleotide sequences. According to one embodiment, the plurality of nucleotide sequences comprises at least 10 nucleotide sequences.

According to one embodiment, generating the first input to the trained deep learning model comprises: selecting a plurality of first assembly locations; and generating a first input based on the selected plurality of first assembly locations. According to one embodiment, generating the first input to the trained deep learning model comprises: selecting a plurality of first assembly locations; and generating a first input based on the selected plurality of first assembly locations. According to one embodiment, selecting the plurality of first locations in the assembling comprises: determining a likelihood that assembly at the plurality of first assembly positions falsely indicates a nucleotide; and selecting a plurality of first assembly locations using the determined likelihoods.

According to one embodiment, generating the first input to be provided to the trained deep learning model comprises comparing respective nucleotide sequences of the plurality of nucleotide sequences to the assembly. According to one embodiment, generating a first input to be provided to the trained deep learning model to identify a nucleotide at a first one of the plurality of first assembly positions comprises: for each of a plurality of nucleotides at each of one or more assembly positions in the neighborhood of the first assembly position: determining a count indicative of the number of nucleotide sequences at which the sequence is indicative of a nucleotide; determining a reference value based on whether the assembly indicates the nucleotide at the position; determining an error value indicative of a difference between the count and the reference value; and including the reference value and the error value in the first input. According to one embodiment, determining the reference value based on whether the assembly indicates a nucleotide at the position comprises: determining the reference value as a first value when the assembly indicates the nucleotide at the position; and determining the reference value as a second value when the assembly does not indicate a nucleotide at the position. According to one embodiment, the first value is the number of the plurality of nucleotide sequences; the second value is 0. According to one embodiment, generating a first input to be provided to the trained deep learning model comprises arranging values into a data structure having columns, wherein: a first column holding reference and error values determined for a plurality of nucleotides at a first assembly position; the second column holds reference values and error values for a plurality of nucleotides at a second assembly position of the one or more assembly positions in the neighborhood of the first assembly position. According to one embodiment, the one or more assembly locations in the vicinity of the first assembly location comprise at least two assembly locations separate from the first assembly location.

According to one embodiment, the one or more likelihoods that each of the one or more respective biopolymers is present at an assembly position for each of the plurality of nucleotides comprises, a likelihood that a nucleotide is present at the assembly position; and identifying the biopolymer at the first plurality of assembly locations comprises: identifying the nucleotide at the first of the plurality of first assembly positions as a first nucleotide of the plurality of nucleotides by determining that the likelihood that the first nucleotide is present at the first position is greater than the likelihood that the second of the plurality of nucleotides is present at the first assembly position.

According to one embodiment, the method further comprises generating an assembly from the plurality of nucleotide sequences. According to one embodiment, generating an assembly from a plurality of nucleotide sequences comprises determining a consensus sequence from the plurality of nucleotide sequences as the assembly. According to one embodiment, generating an assembly from a plurality of nucleotide sequences comprises applying an overlap-and-expand-after-expand (OLC) algorithm to the plurality of nucleotide sequences.

According to one embodiment, the method further comprises: accessing data comprising training data, the sequence data comprising a biopolymer sequence obtained from sequencing a reference macromolecule and a predetermined assembly of reference macromolecules; and training the deep learning model using the training data to obtain a trained deep learning model. According to one embodiment, the reference macromolecule is different from the macromolecule. According to one embodiment, the deep learning model includes a Convolutional Neural Network (CNN).

Drawings

Various aspects and embodiments of the present application will be described with reference to the accompanying drawings. It should be understood that the drawings are not necessarily drawn to scale. Items appearing in multiple figures are denoted by the same reference numeral in all the figures in which they appear.

1A-1C illustrate systems in which aspects of the technology described herein may be implemented, according to some embodiments of the technology described herein.

Fig. 2A-2D illustrate embodiments of an assembly system in accordance with some embodiments of the technology described herein.

Fig. 3A is an example process 300 for training a machine learning model for generating biopolymer assemblies, in accordance with some embodiments of the techniques described herein.

Fig. 3B is an example process 310 for generating biopolymer assemblies using the machine learning model obtained by the process of fig. 3A, according to some embodiments of the techniques described herein.

4A-4C illustrate examples of inputs for generating a machine learning model in accordance with some embodiments of the technology described herein.

Fig. 5 illustrates an example for updating a biopolymer assembly in accordance with some embodiments of the technology described herein.

Fig. 6 shows the structure of an illustrative Convolutional Neural Network (CNN) model for generating biopolymer assemblies, in accordance with some embodiments of the techniques described herein.

Fig. 7 illustrates performance of assembly techniques implemented according to some embodiments of the techniques described herein relative to conventional techniques.

FIG. 8 is a block diagram of an illustrative computing device 800 that may be used to implement some embodiments of the techniques described herein.

Detailed Description

The macromolecule may be a protein or protein fragment, a DNA molecule or fragment (of any type of DNA), or an RNA molecule or fragment (of any type of ribonucleic acid (RNA)). The biopolymer can be an amino acid (e.g., when the macromolecule is a protein or fragment thereof) or a nucleotide (e.g., when the macromolecule is a DNA, RNA, or fragment thereof).

The inventors have developed systems that use machine learning techniques to generate macromolecular biopolymer assemblies. For example, systems developed by the inventors can be configured to employ machine learning techniques to generate genomic assemblies of DNA of an organism. As another example, a system developed by the inventors may be configured to employ machine learning techniques to generate amino acid sequences of proteins.

In some embodiments, the system can access one or more biopolymer sequences (e.g., generated by a sequencing device) and initial assemblies generated from those sequences. Assembly can indicate the presence of a biopolymer (e.g., nucleotide, amino acid) at the corresponding assembly position. The system can correct for errors in the initial assembled biopolymer indication by: (1) generating inputs to be provided to a machine learning model using the sequences and the initial assembly; (2) providing the input to a trained machine learning model to obtain a corresponding output; and (3) updating the initial assembly using the output obtained from the machine learning model to obtain an updated assembly. The updated assembly may have less biopolymer indication error than the initial assembly.

In some embodiments, an assembly can include multiple positions and indications of biopolymers (e.g., nucleotides or amino acids) at the corresponding positions. As an example, an assembly may be a genomic assembly indicating a nucleotide at a position in the genome of an organism. As another example, the assembly may be a gene sequence that indicates a nucleotide sequence of a portion of the organism's DNA. As another example, the assembly may be an amino acid sequence of a protein (also referred to as a "protein sequence"). The biopolymer may be a nucleotide, an amino acid, or any other type of biopolymer. Biopolymer sequences may also be referred to herein as "sequences" or "read lengths".

Some conventional biopolymer assembly techniques may utilize sequencing techniques to generate biopolymer sequences for macromolecules (e.g., DNA, RNA, or proteins), and use the generated sequences to generate assembly of macromolecules. For example, a sequencing device may generate a nucleotide sequence from a DNA sample of an organism, which sequence in turn may be used to generate genomic assembly of the organism's DNA. As another example, the sequencing device may generate amino acid sequences of a protein sample, which in turn may be used to assemble longer amino acid sequences of proteins. The computing device may apply an assembly algorithm to the sequence generated by the sequencing device to generate an assembly. For example, the computing device may apply an overlap-and-extend-first (OLC) assembly algorithm to the nucleotide sequences of the DNA sample to generate a genomic assembly of the organism, or a portion thereof.

One sequencing technique for generating nucleotide sequences from nucleic acid samples is second generation sequencing (also referred to as "short read length sequencing"), which generates nucleotide sequences of less than 1000 nucleotides (i.e., "short read length"). Sequencing technology has now evolved to third generation sequencing (also referred to as "long read-long sequencing") that generates nucleotide sequences of 1000 or more nucleotides (i.e., "long read"), and provides for a greater portion of assembly than second generation sequencing. However, the inventors have recognized that third generation sequencing is less accurate than second generation sequencing, and therefore, assembly generated by long read lengths is less accurate than assembly generated by short read lengths. The inventors have also recognized that conventional error correction techniques for improving assembly accuracy require a large amount of computation and are time consuming. Thus, the inventors developed a machine learning technique for correcting errors in assembly that improves the accuracy of assemblies generated by third generation sequencing (1) and is more efficient than conventional error correction techniques (2).

Some embodiments described herein address all of the above issues recognized by the inventors with respect to generating an assembly. However, it should be understood that not each of these issues is addressed by the embodiments described herein. It should also be understood that embodiments of the technology described herein may be used for purposes other than addressing the above-described problems of biopolymer assembly. As one example, embodiments of the techniques described herein can be used to improve the accuracy of protein sequences generated from amino acid sequences. As another example, embodiments of the techniques described herein may be used to improve the accuracy of the assembly generated by short read lengths.

In some embodiments, the system may be configured to: (1) accessing an assembly (e.g., generated from a plurality of biopolymer sequences) indicative of a biopolymer present at a respective assembly location; (2) generating a first input to be provided to the trained deep learning model using the plurality of biopolymer sequences and the assembly; (3) providing a first input to the trained deep learning model to obtain a respective first output, the first output indicating, for each of a plurality of first assembly locations, one or more likelihoods (e.g., probabilities) that each of one or more respective biopolymers exists at the assembly location; (4) identifying a biopolymer at a plurality of first assembly locations using a first output of a trained deep learning model; and (5) updating the assembly to indicate the identified biopolymer at the plurality of first assembly locations to obtain an updated assembly. In some embodiments, the system can be configured to align a plurality of biopolymer sequences with an assembly.

In some embodiments, the macromolecule can be a protein, the plurality of biopolymer sequences can be a plurality of amino acid sequences, and the assembly indicates the amino acids at the corresponding assembly positions. In some embodiments, the macromolecule can be a nucleic acid (e.g., DNA, RNA), the plurality of biological sequences can be nucleotide sequences, and the assembly indicates the nucleotides at the respective assembly positions.

In some embodiments, the assembly indicates a first nucleotide (e.g., adenine) at a first assembly position of the plurality of assembly positions. Identifying the biopolymer at a plurality of first assembly positions includes identifying a second nucleotide (e.g., thymine) located at a first assembly position different from the first nucleotide; and updating the assembly includes updating the assembly to indicate a second nucleotide (e.g., thymine) at the first assembly position.

In some embodiments, the system may be configured to perform multiple iterations of the update. The system may be configured to, after updating the assembly to obtain an updated assembly: (1) aligning the plurality of nucleotide sequences to the updated assembly; (2) generating a second input to be provided to the trained deep learning model using the plurality of nucleotide sequences and the updated assembly; (3) providing a second input to the trained deep learning model to obtain a corresponding second output, the second output indicating, for each assembly location of the plurality of second assembly locations, a likelihood (e.g., a probability) that each of the one or more corresponding nucleotides is present at that location; (4) identifying nucleotides at a plurality of second assembly positions based on a second output of the trained deep learning model; and (5) updating the updated assembly to indicate the identified nucleotides at a plurality of second assembly positions to obtain a second updated assembly.

In some embodiments, the system may be configured to generate the first input of the trained deep learning model by: (1) selecting a plurality of first assembly locations; and (2) generating a first input based on the selected plurality of first assembly locations. In some embodiments, the system may be configured to select the first plurality of assembly locations by: (1) determining a likelihood that assembly at the plurality of first assembly positions falsely indicates a nucleotide; and (2) selecting a plurality of first assembly locations using the determined likelihoods.

In some embodiments, the system may be configured to generate a first input (e.g., to determine a value of one or more features) to be provided to the trained deep learning model by comparing respective nucleotide sequences of the plurality of nucleotide sequences to the assembly. In some embodiments, the system may be configured to: generating a first input to identify a nucleotide at a first one of a plurality of first assembly positions by, for each of a plurality of nucleotides at each of one or more assembly positions in a neighborhood of the first assembly positions: (1) determining a count indicative of a number of a plurality of nucleotide sequences, the sequences being indicative of nucleotides at assembly positions; (2) determining a reference value based on whether an assembly indicates the nucleotide at an assembly position; (3) determining an error value indicative of a difference between the count and the reference value; and (4) including the reference value and the error value in the first input. In some embodiments, the system may be configured to determine the reference value based on whether the assembly indicates the nucleotide at the assembly position as follows: (1) determining the reference value as a first value (e.g., a number of the plurality of nucleotide sequences) when the assembly indicates the nucleotide at the assembly position; and (2) determining the reference value to be a second value (e.g., 0) when the assembly does not indicate the nucleotide at the assembly position. In some embodiments, the system may be configured to use a neighborhood of 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, or 50 locations.

In some embodiments, the system may be configured to generate a first input to identify a nucleotide at a first assembly position by arranging values into a data structure having rows/columns, wherein: (1) a first row/column holds reference and error values determined for a plurality of nucleotides at a first assembly position; and (2) a second row/column holds reference and error values determined for a plurality of nucleotides at a second position in the neighborhood of the first assembly position.

In some embodiments, the one or more likelihoods that each of the one or more respective biopolymers is present at the assembly location for each of the plurality of nucleotides includes, a likelihood (e.g., a probability) that is present at the assembly location. The system may be configured to identify the biopolymer at a plurality of first assembly locations in the assembly by identifying a nucleotide at a first one of the plurality of first assembly locations as a first one of the plurality of nucleotides. The system may identify the nucleotide at the first assembly position as the first nucleotide by: determining that the likelihood of the first nucleotide being present at the first assembly position is greater than the likelihood of the second of the plurality of nucleotides being present at the first assembly position.

In some embodiments, the system can be configured to generate an assembly (e.g., an initial assembly) from a plurality of nucleotide sequences. In some embodiments, the system can be configured to generate an assembly by determining consensus sequences from a plurality of nucleotide sequences as an assembly (e.g., by taking majority voting). In some embodiments, the system may be configured to generate an assembly from a plurality of nucleotide sequences by applying an overlap-first-extension-last (OLC) algorithm to the plurality of nucleotide sequences. In some embodiments, the system may be configured to: (1) accessing training data comprising biopolymer sequences obtained from sequencing a reference macromolecule and a predetermined biopolymer assembly of the reference macromolecule; and (2) training a deep learning model (e.g., a convolutional neural network or a recurrent neural network) using the training data to obtain a trained deep learning model. In some embodiments, the reference macromolecules used to train the deep learning model may be different from the macromolecules for which the assembly is being generated.

It should be appreciated that the techniques introduced above and discussed in more detail below may be implemented in any of a variety of ways, as the techniques are not limited to any particular implementation. Examples of implementation details provided herein are for illustrative purposes only. Further, the techniques disclosed herein may be used alone or in any suitable combination, as aspects of the techniques described herein are not limited to use with any particular technique or combination of techniques.

FIG. 1A illustrates a system 100 in which aspects of the technology described herein may be implemented. The system 100 includes one or more sequencing apparatuses 102, an assembly system 104, a model training system 106, and a data storage device 108A, each of which is connected to a network 111.

In some embodiments, the sequencing device 102 may be configured to generate sequencing data by sequencing one or more sample specimens 110 of macromolecules. For example, the sample specimen 110 may be a biological sample containing nucleic acids (e.g., DNA and/or RNA) or proteins (e.g., peptides). The sequencing data may include a biopolymer sequence of the sample specimen 110. The biopolymer sequence may be represented as a sequence of alphanumeric symbols indicating the order and location of the biopolymers present in the macromolecular sample. In some embodiments, the biopolymer sequence can be a nucleotide sequence generated by sequencing a biological sample. As an example, the nucleotide sequence can use: (1) "A" represents adenine; (2) "C" represents cytosine; (3) "G" represents guanine; (4) "T" represents thymine; (5) "U" represents uracil; and (6) '-' represents that no nucleotide is present at a position in the sequence. In some embodiments, the biopolymer sequence can be an amino acid sequence generated by sequencing a protein sample (e.g., a peptide). By way of example, the amino acid sequence may be an alphanumeric sequence that uses different alphanumeric characters to represent corresponding different amino acids that may be present in the protein.

In some embodiments, the sequencing device 102 can be configured to generate nucleotide sequences by sequencing a nucleic acid sample (e.g., a DNA sample). In some embodiments, the sequencing device 102 can be configured to sequence a nucleic acid sample by synthesis. The sequencing device 102 can be configured to identify a nucleotide when the nucleotide is incorporated into a newly synthesized strand of nucleic acid that is complementary to the nucleic acid being sequenced. During sequencing, a polymerase (e.g., a DNA polymerase) can bind to (e.g., attach to) a priming location of a target nucleic acid molecule (referred to as a "primer"), and incorporate nucleotides into the primer by the action of the polymerase. The sequencing device 102 can be configured to detect each nucleotide incorporated. In some embodiments, the nucleotides can be associated with respective luminescent molecules (e.g., fluorophores) that emit light in response to excitation. The luminescent molecule may be excited when a corresponding nucleotide associated with the luminescent molecule is incorporated. The sequencing device 102 may include one or more sensors for detecting light emission. Each type of nucleotide may be associated with a corresponding type of luminescent molecule. The sequencing device 102 can identify the incorporated nucleotide by identifying the type of luminescent molecule based on the detected light emission. For example, the sequencing device 102 can use light emission intensity, lifetime, wavelength, or other properties to distinguish between different luminescent molecules. In some embodiments, the sequencing device 102 can be configured to detect an electrical signal generated during nucleotide incorporation to identify the incorporated nucleotide. The sequencing device 102 can include a sensor for detecting electrical signals and using these electrical signals to identify the incorporated nucleotides.

In some embodiments, the sequencing device 102 can be configured to sequence nucleic acids using different techniques than those described herein. Some embodiments are not limited to any particular nucleic acid sequencing technique described herein.

In some embodiments, the sequencing device 102 can be configured to generate amino acid sequences by sequencing a protein sample (e.g., a peptide). In some embodiments, the sequencing device 102 can be configured to sequence a protein sample using reagents that selectively bind to corresponding amino acids. The agent may selectively bind to one or more types of amino acids but not other types of amino acids. In some embodiments, the reagents may be associated with respective luminescent molecules. The luminescent molecule may be excited in response to an interaction between the reagent with which it is associated and the amino acid. In some embodiments, the sequencing device 102 can be configured to identify amino acids by detecting light emission of a luminescent molecule. The sequencing device 102 may include one or more sensors for detecting light emission. In some embodiments, each type of amino acid can be associated with a corresponding type of luminescent molecule. The sequencing device 102 can identify the amino acid by identifying the type of luminescent molecule based on the detected light emission. As an example, the sequencing device 102 can use light emission intensity, lifetime, wavelength, or other properties to distinguish between different luminescent molecules. In some embodiments, the sequencing device 102 can be configured to detect an electrical signal generated during a binding interaction between a reagent and an amino acid. The sequencing device 102 may include sensors for detecting electrical signals and using these signals to identify amino acids involved in the corresponding binding interaction.

In some embodiments, the sequencing device 102 can be configured to sequence proteins using different techniques than those described herein. Some embodiments are not limited to any particular protein sequencing technique described herein.

As shown in the embodiment of fig. 1A, the sequencing apparatus 102 may be configured to send sequencing data generated by the apparatus 102 to the data storage device 108A for storage. The sequencing data may include sequences generated by sequencing of the macromolecule sample. The sequencing data may be used by one or more other systems. As an example, sequencing data may be used by the assembly system 104 to generate an assembly of macromolecules. As another example, the sequencing data may be used by the model training system 106 as training data to train a machine learning model for use by the assembly system 104. Exemplary uses of sequencing data are described herein.

In some embodiments, the assembly system 104 may be a computing device configured to generate the assemblies 112 using sequencing data generated by the sequencing device 102. The assembly system 104 includes a machine learning model 104A, and the assembly system 104 generates assemblies using the machine learning model 104A. In some embodiments, the machine learning model 104A may be a trained machine learning model obtained from the model training system 106. Examples of machine learning models that may be used by the assembly system 104 are described herein.

In some embodiments, the assembly system 104 may be configured to generate the assembly 112 by updating the initial assembly. Initial assembly can be obtained by applying conventional assembly algorithms to the sequencing data. In some embodiments, the assembly system 104 may be configured to generate an initial assembly. The assembly system 104 may be configured to generate an initial assembly by applying an assembly algorithm to sequencing data obtained from the sequencing device 102. As an example, the assembly system 104 can apply overlap-before-extend (OLC) assembly or dbbrulain graph (DBG) assembly to sequencing data (e.g., nucleotide sequences) from the data storage device 108A to generate an initial assembly. In some embodiments, the assembly system 104 may be configured to obtain an initial assembly generated by a system separate from the assembly system 104. As an example, the assembly system 104 can receive an initial assembly generated by a computing device separate from the assembly system 104 that applies an assembly algorithm to sequencing data generated by the sequencing device 102.

In some embodiments, the assembly system 104 may be configured to use the trained machine learning model 104A to update or improve an assembly (e.g., an initial assembly obtained from application of an assembly algorithm). The assembly system 104 may be configured to update the assembly by correcting one or more errors in the assembly and/or confirming a biopolymer indication in the assembly. In some embodiments, the assembly system 104 may be configured to update the assembly by: (1) generate an input to the machine learning model 104A using the sequencing data and the assembly; (2) providing the generated inputs to the machine learning model 104A to obtain corresponding outputs; and (3) update the assembly using the output obtained from the machine learning model 104A. In some embodiments, for each of a plurality of positions in an assembly, the output of the machine learning model 104A may indicate one or more likelihoods that each of one or more respective biopolymers (e.g., nucleotides or amino acids) is present at that position in the assembly. As an example, for each location, the output may indicate a probability that the corresponding nucleotide is present at that location. In some embodiments, the assembly system 104 may be configured to: (1) identifying a biopolymer (e.g., a nucleotide or an amino acid) at an assembly position using output obtained from the machine learning model 104A; and (2) updating the assembly to indicate the identified biopolymer at those locations, thereby obtaining an updated assembly. Example techniques for updating an assembly using a machine learning model are described herein.

In some embodiments, the assembly system 104 may be configured to identify a location in the assembly to update (e.g., correct or confirm). The assembly system 104 can be configured to use the selected locations to generate inputs to the machine learning model 104A. In some embodiments, the assembly system 104 may be configured to identify the location to update by: (1) determining a likelihood that the indication of the biopolymer at the respective assembly location is incorrect; and (2) selecting a location to correct based on the determined likelihood. In some embodiments, the assembly system 104 may be configured to determine a numerical value indicating a likelihood that the indicated biopolymer at the respective location is incorrect, and select a location to update based on the likelihood value. As an example, the assembly system 104 may select a location for which the likelihood of being incorrect is greater than a threshold.

In some embodiments, the assembly system 104 may be configured to generate inputs to the machine learning model 104A by determining feature values for locations in the assembly. The assembly system 104 may be configured to determine feature values using the assemblies and the sequence from which the assemblies are generated. Example features are described herein. In some embodiments, the assembly system 104 may be configured to generate inputs of the machine learning model 104A for each of a plurality of locations. For each location, the assembly system 104 may be configured to determine feature values and provide the feature values as inputs to the machine learning model 104A to obtain a corresponding output. The assembly system 104 may be configured to use the output corresponding to the input provided for the location to correct or confirm that the biopolymer indicated at the location is correct. In some embodiments, the plurality of locations may be all locations in the assembly. In some embodiments, the plurality of locations may be a subset of locations in the assembly.

In embodiments where a subset of locations is updated, the assembly system 104 may be configured to select a subset of locations. The assembly system 104 may be configured to select the subset of locations in a variety of ways including: (1) determining a likelihood that the assembly incorrectly indicates the biopolymers at the plurality of locations; and (2) selecting a subset of locations from the plurality of locations using the likelihood. For example, the assembly system 104 may: (1) identifying a location for which the likelihood exceeds a threshold likelihood; and (2) selecting the identified locations as a subset of locations.

In some embodiments, the assembly system 104 may be configured to generate an input for the location to be corrected using the feature values determined at one or more locations in the neighborhood of the location to be corrected. For the selected location, the machine learning model 104A may utilize background information from surrounding locations in the assembly to generate an output for the selected location. In some embodiments, the neighborhood of locations may include: (1) a selected location; and (2) a set of locations surrounding the selected location. As an example, the neighborhood may be a window of locations centered at a selected location for which the machine learning model 104A generates output. The assembly system 104 may use a 5-position window, a 10-position window, a 15-position window, a 20-position window, a 25-position window, a 30-position window, a 35-position window, a 40-position window, a 45-position window, and/or a 50-position window.

In some embodiments, the assembly system 104 may be configured to perform multiple update iterations to generate the final assembly 112. As an example, the assembly system 104 may: (1) performing a first iteration on the initial assembly to obtain a first updated assembly; and (2) performing a second iteration on the first updated assembly to obtain a second updated assembly. In some embodiments, the assembly system 104 may be configured to iteratively perform the updates. The assembly system 104 may be configured to perform update iterations until a condition is satisfied. Example conditions are described herein.

In some embodiments, the model training system 106 may be a computing apparatus configured to access data stored in the data store 108A and train a machine learning model using the accessed data for generating an assembly. In some embodiments, the model training system 106 may be configured to train separate machine learning models for different assembly systems. The machine learning models trained for the respective assembly systems may be tailored to the unique characteristics of the assembly systems. As an example, the model training system 106 may be configured to: (1) training a first machine learning model for a first assembly system; and (2) training a second machine learning model for the second assembly system. The independent machine learning model for each assembly system may be adapted to the unique error profile of the respective assembly system. For example, different assembly systems may employ different assembly algorithms to generate the initial assembly, and the machine learning model trained for each assembly system may be adapted to the error profile of the assembly algorithm.

In some embodiments, the model training system 106 may be configured to provide a single trained machine learning model to multiple assembly systems. As an example, the model training system 106 may aggregate assemblies from multiple assembly systems and train a single machine learning model. A single machine learning model may be standardized for multiple assembly systems to mitigate model variations caused by variations in assembly techniques employed by the assembly systems. In some embodiments, the model training system 106 may be configured to provide a single trained machine learning model for multiple sequencing devices. As an example, the model training system 106 can aggregate sequencing data from multiple sequencing devices and train a single machine learning model. A single machine learning model may be normalized for multiple sequencing devices to mitigate model changes caused by device changes.

In some embodiments, the model training system 106 may be configured to train the machine learning model by using training data comprising: (1) biopolymer sequences obtained from sequencing one or more reference macromolecules (e.g., DNA, RNA, protein); and (2) one or more predetermined assemblies of the reference macromolecule. In some embodiments, the model training system 106 may be configured to use the indications of the biopolymers in the predetermined assembly as markers for training the machine learning model. The indicia may represent a correct or desired indication at the assembly location. As an example, the training data may include a nucleotide sequence of a sequenced DNA sample from the organism, as well as a predetermined genomic assembly of the organism. In this example, model training system 106 may use indications of nucleotides in a predetermined genome assembly as markers for applying a supervised learning algorithm to the training data.

In some embodiments, model training system 106 may be configured to access training data from an external database. By way of example, model training system 106 may have access to: (1) sequencing data from the RS II (Pacbio) database and/or the ont (nanopore miniion) database of pacifics Biosciences; and (2) predetermined genome assemblies from the National Center for Biotechnology Information (NCBI) reference genome database. As another example, the model training system 106 may access protein sequencing data and associated proteomic assemblies from a UnitProt database and/or a Human Proteomic Planning (HPP) database.

In some embodiments, the model training system 106 may be configured to train the machine learning model by applying a supervised learning training algorithm using the labeled training data. As an example, the model training system 504 can train a deep learning model (e.g., a neural network) by using a stochastic gradient descent. As another example, the model training system 106 may train a Support Vector Machine (SVM) by optimizing a cost function to identify decision boundaries of the SVM. As an example, the model training system 106 may: (1) generating an input to a machine learning model using sequencing data and an assembly generated by applying an assembly algorithm to the sequencing data; (2) tagging input with a predetermined assembly of macromolecules (e.g., from a public database); and (3) applying a supervised training algorithm to the generated inputs and corresponding labels.

In some embodiments, the model training system 106 may be configured to train the machine learning model by applying an unsupervised learning algorithm to the training data. As an example, model training system 106 may identify clusters of a clustering model by performing k-means clustering. In some embodiments, the model training system 106 may be configured to: (1) generating an input to a machine learning model using the sequencing data and an assembly generated by applying an assembly algorithm to the sequencing data; and (2) applying an unsupervised learning algorithm to the generated input. As an example, the model training system 106 may train a cluster model, where each cluster of the model represents a respective nucleotide, and the cluster classification may indicate the nucleotide at a position in a genome assembly or gene sequence. As another example, the model training system 106 may train a clustering model, wherein each cluster of the model represents a respective amino acid, and the cluster classification may indicate the amino acid at a position in the protein sequence.

In some embodiments, the model training system 106 may be configured to train the machine learning model by applying a semi-supervised learning algorithm to the training data. In some embodiments, the model training system 106 may be configured to apply a semi-supervised learning algorithm to the training data by: (1) tagging a set of unlabeled training data by applying an unsupervised learning algorithm (e.g., clustering) to the training data; and (2) applying a supervised learning algorithm to the labeled training data. As an example, the model training system 106 may: (1) generating an input to a machine learning model using the sequencing data and an assembly generated by applying an assembly algorithm to the sequencing data; (2) applying an unsupervised learning algorithm to the generated input to label the input; and (3) applying a supervised learning algorithm to the labeled training data.

In some embodiments, the machine learning model may include a deep learning model (e.g., a neural network). In some embodiments, the deep learning model may include a Convolutional Neural Network (CNN). In some embodiments, the deep learning model may include a Recurrent Neural Network (RNN), a multi-layer perceptron, an auto-encoder, and/or a CTC-fitted neural network model. In some embodiments, the machine learning model may include a clustering model. As an example, a clustering model may include a plurality of clusters, each cluster associated with a biopolymer (e.g., a nucleotide or an amino acid).

In some embodiments, the model training system 106 may be configured to train a separate machine learning model for each of a plurality of sequencing devices. A machine learning model trained for a respective sequencing device may be tailored to the unique characteristics of that sequencing device. As an example, the model training system 106 may: (1) training a first machine learning model for a first sequencing device; and (2) training a second machine learning model for the second sequencing device. Machine learning models trained for a respective sequencing device can be optimized for use with sequencing data generated by the sequencing device. For example, the machine learning model can be optimized for a particular sequencing technique used by the sequencing apparatus (e.g., third generation sequencing).

In some embodiments, the model training system 106 may be configured to periodically update a previously trained machine learning model. In some embodiments, the model training system 106 may be configured to update a previously trained model by updating values of one or more parameters of the machine learning model using new training data. In some embodiments, the model training system 106 may be configured to update the machine learning model by training a new machine learning model using a combination of previously obtained training data and the new training data.

In some embodiments, the model training system 106 may be configured to update the machine learning model in response to any of the different types of events. For example, in some embodiments, the model training system 106 may be configured to update the machine learning model in response to a user command. As an example, model training system 106 may provide a user interface via which a user may command the performance of a training process. In some embodiments, the model training system 106 may be configured to automatically (i.e., not in response to user commands) update the machine learning model, for example, in response to software commands. As another example, in some embodiments, the model training system 106 may be configured to update the machine learning model in response to detecting one or more conditions. For example, the model training system 106 may update the machine learning model in response to detecting expiration of the time period. As another example, the model training system 106 may update the machine learning model in response to receiving a threshold amount of new training data (e.g., a number of sequences and/or assemblies).

Although in the exemplary embodiment shown in FIG. 1A, the model training system 106 is separate from the assembly system 104, in some embodiments, the model training system 106 may be part of the assembly system 104. Although in the exemplary embodiment shown in fig. 1A, the assembly system 104 is separate from the sequencing device 102, in some embodiments, the assembly system 104 can be a component of the sequencing device. In some embodiments, the sequencing device 102, the model training system 106, and the assembly system 104 may each be a component of a single system.

In some embodiments, data storage device 108A may be a system for storing data. In some embodiments, data storage 108A may include one or more databases maintained by one or more computing devices (e.g., servers). In some embodiments, data storage 108A may include one or more physical storage devices. By way of example, the physical storage device may include one or more solid state drives, hard disk drives, flash drives, and/or optical drives. In some embodiments, the data storage device 108A may include one or more files that store data. By way of example, the data storage 108A may include one or more text files that store data. As another example, the data storage device 108A may include one or more XML files. In some embodiments, the data storage 108A may be a storage device (e.g., a hard disk drive) of a computing apparatus. In some embodiments, data storage device 108A may be a cloud storage system.

In some embodiments, the network 111 may be a wireless network, a wired network, or any suitable combination thereof. As one example, the network 111 may be a Wide Area Network (WAN), such as the internet. In some embodiments, network 111 may be a Local Area Network (LAN). The local area network may be formed by wired and/or wireless connections between the sequencing apparatus 102, the assembly system 104, the model training system 106, and the data storage device 108A. Some embodiments are not limited to any particular type of network described herein.

FIG. 1B shows an example when system 100 is configured to generate a gene assembly. The gene assembly may be a genomic assembly or a gene sequence. For example, the exported assembly 112 may be a genetic assembly. The sequencing device 102 may be configured to sequence the nucleic acid sample 110 to generate a nucleotide sequence. As an example, the sequencing device 102 may sequence a DNA sample from an organism to generate a nucleotide sequence. The nucleotide sequence generated by the sequencing apparatus 102 can be stored in the data storage device 108B. The assembly system 104 may be configured to generate gene assemblies using the machine learning model 104A. As an example, the assembly system 104 may: (1) obtaining an initial gene assembly by applying an assembly technique (e.g., OLC) to the nucleotide sequence generated by the sequencing device 102; and (2) updating the initial gene assembly using the machine learning model 104A to obtain the gene assembly 112.

Fig. 1C shows an example when system 100 is configured for generating a protein sequence. For example, the exported assembly 112 may be a protein sequence. The sequencing device 102 may be configured to sequence the protein sample specimen 110 to generate an amino acid sequence. As an example, the sequencing device 102 can sequence peptides from a protein to generate an amino acid sequence. The amino acid sequence generated by the sequencing apparatus 102 can be stored in the data storage device 108C. The assembly system 104 may be configured to generate protein sequences using a machine learning model 104A. As an example, the protein sequencing system 104 may: (1) applying an assembly algorithm to the amino acid sequence generated by the sequencing device 102 to obtain a protein sequence; and (2) updating the protein sequence using the machine learning model 104A to obtain the protein sequence.

Fig. 2A illustrates an assembly system 200 for generating an assembly, in accordance with some embodiments of the technology described herein. The assembly system 200 may be the assembly system 104 described above with reference to fig. 1A-C. The assembly system 200 may be a computing device configured to generate an assembly 204 using the sequencing data 202. The assembly system 200 includes a number of components, including a feature generator 200A and a machine learning model 200B. Assembly system 200 may optionally include an assembler 200C.

In some embodiments, the feature generator 200A may be configured to determine values for one or more features that may be provided as input to the machine learning model. Feature generator 200A may be configured to determine the value of a feature by: (1) sequence data 202; and (2) assembly (e.g., obtained by applying an assembly algorithm to the sequence data 202). The sequence data 202 can include a plurality of sequences that are used by an assembly algorithm to generate assemblies. In some embodiments, the feature generator 200A may be configured to determine the value of the feature by comparing each sequence to the assembly. In some embodiments, feature generator 200A may be configured to align a sequence with a portion of an assembly. For example, feature generator 200A may align a sequence to a set of positions in an assembly, wherein an indication of a biopolymer at the set of positions in the assembly is determined from the aligned sequences. The feature generator 200A may be configured to determine a value for a feature by comparing the aligned sequences to biopolymers (e.g., nucleotides, amino acids) indicated at the set of positions in the assembly. Example techniques for determining values for features are described below with reference to fig. 4A-C.

As shown in the embodiment of fig. 2A, feature generator 200A may be configured to generate inputs to be provided to machine learning model 200B. In some embodiments, the feature generator 200A may be configured to generate an input for each of a plurality of locations in the assembly. In some embodiments, the feature generator 200A may be configured to select a location and generate an input using the selected location. In some embodiments, the feature generator 200A may be configured to select the locations as follows: a likelihood of assembling the biopolymer at the incorrectly indicated location is determined, and the location is selected using the determined likelihood. In some embodiments, the feature generator 200A may be configured to determine the likelihood that the assembly incorrectly indicates a biopolymer at a location based on the number of sequences aligned to the location that specify biopolymers different from the biopolymer indicated in the assembly. Feature generator 200A may be configured to generate an input for the location when the likelihood is determined to exceed a threshold likelihood.

In some embodiments, feature generator 200A may be configured to generate inputs to be provided to machine learning model 200B for the target location in the assembly using: (1) a biopolymer identified at a target location; and (2) a biopolymer identified at one or more other locations in the vicinity of the target location. In some embodiments, feature generator 200A may be configured to determine feature values at the target location and at other locations in the neighborhood of the target location. Feature values at other locations in the neighborhood may provide background information to the machine learning model 200A to generate an output for the target location. In some embodiments, the size of the neighborhood may be a configurable parameter. For example, the size of the neighborhood may be specified by user input in the software application.

In some embodiments, feature generator 200A may be configured to generate the following inputs: the input is a window comprising feature values determined at locations in the neighborhood of the target location. The neighborhood of the target location may include the target location and one or more other locations in the window of the target location. In some embodiments, the size of the window may be 2 locations, 3 locations, 5 locations, 10 locations, 15 locations, 20 locations, 25 locations, 30 locations, 35 locations, 40 locations, 45 locations, or 50 locations. In some embodiments, feature generator 200A may be configured to use the following neighborhood size: 60 locations, 70 locations, 80 locations, 90 locations, or 100 locations. In some embodiments, the window may be centered on the target position.

In some embodiments, the machine learning model 200B may be the machine learning model 104A described above with reference to fig. 1A-C. As shown in the embodiment of fig. 1A, the machine learning model 200B may be configured to receive input from the feature generator 200A. The machine learning model 200B may be configured to generate outputs corresponding to respective inputs provided by the feature generator 200A. The machine learning model 200B may be configured to generate the following output: the output is used by the assembly system 200 to identify a biopolymer (e.g., a nucleotide or an amino acid) at a location in the assembly. In some embodiments, the machine learning model 200B may be configured to output, for each location, a likelihood that each of the plurality of biopolymers is present at the location. As an example, the machine learning model 200B may output, for each of a plurality of nucleotides, a probability that the nucleotide is present at the location. As another example, the machine learning model 200B may output, for each of a plurality of amino acids, a probability that the amino acid is present at the position. In some embodiments, the assembly system 200 may be configured to identify a biopolymer located at a location in the assembly as one of the biopolymers with the greatest likelihood of being present at the location as indicated by the output of the machine learning model 200B. As an example, the assembly system 200 can select from a plurality of nucleotides the one with the greatest likelihood of being present at the location. As another example, the assembly system 200 can select the one of the plurality of amino acids that is most likely to be present at the position.

In some embodiments, the assembly system 200 may be configured to generate the output assembly 204 using the output obtained from the machine learning model 200B. The assembly system 200 may be configured to update the assembly using the biopolymers identified at the locations in the assembly according to the output obtained from the machine learning model 200B. The assembly system 200 may be configured to update an assembly to indicate the identified biopolymer at a location in the assembly to obtain an output assembly 204. As one example, an assembly may indicate that adenine is at a first position in the assembly and guanine is at a second position in the assembly. In this example, the assembly system 200 may: (1) using the output obtained from the machine learning model 200B, the nucleotide at the first position is identified as thymine and the nucleotide at the second position is identified as guanine; and (2) updating the first position in the assembly to indicate thymine and leaving the nucleotide indicated at the second position unchanged to generate an output assembly 204. As shown in the above example, the assembly system 200 may use the output obtained from the machine learning model 200B to modify the biopolymer indication at the location in the assembly while keeping the biopolymer indication unchanged at other locations. For example, the assembly system 200 can determine that the biopolymer identified at the location in the assembly matches the biopolymer indicated in the assembly and leave the indication at the location unchanged in the updated assembly.

As shown in the embodiment of fig. 1A, assembler 200C may be configured to provide the assembly to feature generator 200A. In some embodiments, assembler 200C may be configured to generate assemblies to be provided to feature generator 200A by applying an assembly algorithm to sequence data 202 (e.g., received from sequencing a high-molecular sample). As an example, assembler 200C may be configured to apply an assembly algorithm to nucleotide sequences included in sequence data 202 to generate assemblies. The assembly may then be provided to feature generator 200A to generate inputs to be provided to machine learning model 200B to obtain an output for identifying the biopolymer at the location in the assembly. The assembly generated by assembler 200C may be updated by assembly system 200 using the output obtained from machine learning model 200B to generate output assembly 204.

In some embodiments, assembler 200C may be configured to apply an overlap-and-extend-before-Overlap (OLC) algorithm to nucleotide sequences included in sequence data 202 to generate an assembly. The sequencing device can sequence multiple copies of a biological sample comprising nucleic acids. Thus, for each portion (e.g., set of locations) of an assembly, the sequence data 202 can include a plurality of sequences that are aligned with that portion of the assembly. The average number of sequences covering a position in an assembly may be referred to as the "coverage" of the sequence. Assembler 200C may be configured to apply the OLC algorithm to the sequence by: (1) generating an overlap map based on the overlap regions of the sequences; (2) using the contigs, a layout of sequences aligned to the corresponding parts of the assembly (also referred to as "contigs") is generated; and (3) for each set of sequences aligned to a portion of the assembly, obtaining consensus sequences for the sequences in the set to generate the portion of the assembly.

In some embodiments, assembler 200C may be configured to identify sequences having overlapping regions by comparing pairs of sequences to determine whether they include one or more identical biopolymer (e.g., nucleotide) subsequences. In some embodiments, assembler 200C may be configured to: (1) identifying as overlapping sequences pairs of identical subsequences that share at least a threshold number of nucleotides (e.g., 3, 4, 5, 6, 8, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500); (2) determining the length (i.e., the number of nucleotides) of each overlapping region; and (3) generating an overlap map based on the identified overlapping sequences and the length of the overlapping regions. The overlap map may include sequences of vertices and edges that are respective pairs of sequences that connect the overlaps. The determined length may be used as a marker for the edge in the overlay.

In some embodiments, assembler 200C may be configured to generate a layout of sets of sequences aligned with corresponding portions of the assembly by concatenating the sequences together using an overlap map. Assembler 200C may be configured to find paths through the overlay graph to join the sequences. As an example, assembler 200C may concatenate a set of alphanumeric characters representing nucleotides to obtain a concatenated sequence. In some embodiments, assembler 200C may apply a greedy algorithm to the overlapping graphs to identify the join sequence. As an example, assembler 200C may apply a greedy algorithm to identify the shortest common superstring as the concatenated sequence.

In some embodiments, assembler 200C may be configured to generate assemblies using the layout sequence. In some embodiments, assembler 200C may identify multiple sets of layout sequences, wherein each set of layout sequences is aligned with a portion of the assembly. Assembler 200C may be configured to generate the assembled portion by agreeing with the layout sequence aligned with the assembled portion. In some embodiments, assembler 200C may be configured to achieve correspondence by determining that a biopolymer (e.g., a nucleotide) at a location in the portion of the assembly is a majority of the sequences aligned with the portion of the assembly that are indicative of the biopolymer at the location. As an example, assembler 200C can generate an overlay map of nucleotide sequences and identify four nucleotide sequences "TAGA", "TAGA" corresponding to a set of four positions in the assembly "

"TAGT", "TAGA" and "TAGC". In this example, assembler 200C may determine that the agreement among the four nucleotide sequences is "TAGA" because all four nucleotide sequences indicate that the first three positions are "TAGs" and most nucleotide sequences indicate that the fourth position is "a".

In some embodiments, the assembly system 200 may be configured to perform the consensus steps of the OLC algorithm using machine learning techniques. When assembler 200C has generated a layout for generating assemblies, the system may be configured to generate inputs for a machine learning model using the layout and consistent assemblies obtained from the layout. In some embodiments, the assembly system 200 may be configured to update the consistent assembly using the techniques described herein to obtain the output assembly 204.

In some embodiments, assembler 200C may be configured to apply an algorithm to sequence data 202 described in the following article: "Assembly Algorithms for Next-Generation Sequencing Data" published at 6 th of Genomics Volume 95, 6.2010, which is incorporated herein by reference in its entirety. In some embodiments, assembler 200C may be configured to apply an assembly algorithm, different from the OLC algorithm, to sequence data 202 to generate an assembly. In some embodiments, assembler 200C may be configured to apply a de-brevie map (DBG) assembly to sequence data 202. Some embodiments are not limited to a particular type of assembly algorithm. In some embodiments, assembler 200C may include a software application configured to generate assemblies using sequence data 202. By way of example, the system may include an HGAP, Falcon, Canu, Hinge, Miniasm, or Flye assembler. As another example, the system may include SPAdes, Ray, ABySS, ALLPATHS-LG, or Trinity assembly applications. Some embodiments are not limited to a particular assembler.

As shown in phantom in fig. 2A, in some embodiments, assembler 200C may not be included in the assembly system. The assembly system 200 may be configured to receive assemblies from individual systems and update the received assemblies to generate output assemblies 204. As an example, a separate computing device may apply an assembly algorithm (e.g., OLC) to the sequence data 202 to generate an assembly, and send the generated assembly to the assembly system 200.

Fig. 2B illustrates an embodiment of the assembly system 200 described above with reference to fig. 2A, wherein the assembly system 200 is configured to perform multiple update iterations of the assembly, as indicated by the feedback arrows from the machine learning model 200B to the feature generator 200A. In some embodiments, the assembly system 200 may be configured to determine values of one or more features that may be provided as input to the machine learning model 200B after obtaining the first updated assembly. Feature generator 200A may be configured to determine a value of a feature from: (1) sequence data 202; and (2) a first updated population obtained by updating the initial population obtained by applying the population algorithm to the sequence data 202. The feature generator 200A may be configured to provide the determined values of the features as inputs to the machine learning model 200B to obtain an output. The assembly system 200 may be configured to use the output from the machine learning model 200B to: (1) identifying a biopolymer at a corresponding location in a first update assembly; and (2) updating the first updated assembly to indicate that the identified biopolymer is at the corresponding location to obtain a second updated assembly. The second updated assembly may be the assembly 204 output by the assembly system 200.

In some embodiments, the assembly system 200 may be configured to perform an update iteration until a condition is satisfied. In some embodiments, the assembly system 104 may be configured to perform the update iterations until the system determines that a threshold number of iterations have been performed. In some embodiments, the threshold number of iterations may be set by user input (e.g., software commands or hard-coded values). In some embodiments, the assembly system 104 may be configured to determine a threshold number of iterations. As an example, the assembly system 200 may determine the threshold number of update iterations based on the type of assembly technique used to obtain the initial assembly. In some embodiments, the assembly system 200 may be configured to iteratively update the assembly until a specified stopping criterion is met. As an example, the assembly system 200 may: (1) determining a number of differences between a current assembly and a previous assembly obtained from a most recent update iteration; and (2) determine to stop iteratively updating the assembly when the number of differences is less than a threshold number of differences and/or when a percentage of differences is less than a threshold percentage.

FIG. 2C illustrates an embodiment of the assembly system 200 described above with reference to FIG. 2A, wherein the assembly system 200 is configured to modify multiple positions of an assembly in parallel, as indicated by the multiple arrows from the feature generator 200A to the machine learning model 200B. As described with reference to fig. 2A, in some embodiments, feature generator 200A may be configured to generate an input to be provided to machine learning model 200B for each of a plurality of locations. In the embodiment of fig. 2C, the assembly system 200 may be configured to update multiple locations of the assembly in parallel. The assembly system 200 may be configured to: (1) updating a first location in the assembly; and (2) beginning to update the second location in the assembly before the updating of the first location in the assembly is completed. In some embodiments, the assembly system 200 may be configured to update multiple locations in parallel by generating and/or providing multiple inputs generated for multiple respective locations in parallel to the machine learning model 200B. As an example, feature generator 200A may: (1) generate and/or provide first input to the machine learning model 200B for the first location; and (2) generating and/or providing second inputs to the machine learning model 200B for the second location prior to obtaining the output from the machine learning model 200B corresponding to the first input.

In some embodiments, the assembly system 200 of fig. 2C may be a computing device including multiple processors configured to update multiple locations of an assembly in parallel. In some embodiments, the assembly system 200 may be configured to use a multi-threaded application, wherein each thread of the application is configured to update a respective location in the assembly in parallel with one or more other threads.

Fig. 2D illustrates an embodiment of the assembly system 200 described above with reference to fig. 2A, wherein the assembly system 200 is configured to: (1) performing a plurality of update iterations, as indicated by the arrows from the machine learning model 200B to the feature generator 200A; and (2) modify the assembled plurality of positions in parallel, as indicated by the plurality of arrows from the feature generator 200A to the machine learning model 200B. In some embodiments, the assembly system 200 may be configured to perform multiple update iterations as described above with reference to fig. 2B, and during each update cycle, update multiple locations in the assembly in parallel as described above with reference to fig. 2C.

Fig. 3A illustrates an example process 300 for training a machine learning model for generating biopolymer assemblies, in accordance with some embodiments of the techniques described herein. Process 300 may be performed by any suitable computing device. As an example, the process 300 may be performed by the model training system 106 described with reference to FIGS. 1A-C. Process 300 may be performed to train a machine learning model as described herein. As an example, the process 300 may be performed to train a deep learning model, such as the Convolutional Neural Network (CNN)600 described with reference to fig. 6.

In some embodiments, the machine learning model may be a deep learning model. In some embodiments, the deep learning model may be a neural network. As an example, the machine learning model may be a Convolutional Neural Network (CNN) that generates outputs for identifying biopolymers (e.g., nucleotides, amino acids) at positions in the assembly. As another example, the machine learning model may be a CTC fitting neural network. In some embodiments, portions of the deep learning model may be trained separately. As an example, a deep learning model may have: a first portion encoding input data into values of one or more features; and a second part that receives as input the value of the characteristic to generate an output identifying one or more biopolymers.

In some embodiments, the machine learning model may be a clustering model. In some embodiments, each cluster of the model may be associated with a biopolymer. As an illustrative example, the clustering model may include 5 clusters, where each cluster is associated with a respective nucleotide. For example, the first cluster may be an associated adenine; the second class may be associated with cytosine; the third cluster may be associated with guanine; the fourth cluster may be associated with thymine; the fifth cluster may indicate (e.g., at the position in the assembly) that no nucleotides are present. For illustrative purposes, example numbers of clusters and associated biopolymers are described herein.

The process 300 begins at block 302, where the system performing the process 300 accesses sequencing data from sequencing one or more reference macromolecules (e.g., DNA, RNA, or protein) at 302. In some embodiments, the system can be configured to access sequencing data from a sequencing reference macromolecule from a database. As one example, the system may access sequencing data obtained from sequencing bacteria from an ONG database. Sequencing data may be obtained from sequencing one or more samples of the macromolecule. For example, sequencing data may be obtained from a biological sample of Saccharomyces cerevisiae, which is one species of yeast. As another example, sequencing data may be obtained from sequencing a peptide sample of a protein. In some embodiments, sequencing data can include nucleotide sequences obtained from sequencing a biological sample comprising nucleic acids (e.g., DNA, RNA). In some embodiments, the sequencing data can include amino acid sequences obtained from sequencing a protein sample (e.g., peptides from a protein).

In some embodiments, the system may be configured to access sequencing data from a target sequencing technique such that a machine learning model may be trained to improve the accuracy of an assembly generated from sequencing data generated by the target sequencing technique. The machine learning model may be trained on an error distribution of the target sequencing technique such that the machine learning model may be optimized to correct error characteristics of the target sequencing technique. In some embodiments, the system may be configured to access data obtained from third generation sequencing. In some embodiments, the third generation sequencing may be single molecule real-time sequencing. As an example, the system can access data obtained from a system for sequencing a nucleic acid sample by detecting light emission of a luminescent molecule associated with a nucleotide. As another example, the system can access data obtained from a system that sequences peptides by detecting light emission of luminescent molecules associated with reagents that selectively interact with amino acids. In some embodiments, the system may be configured to access data obtained from second generation sequencing. By way of example, the system can access sequencing data obtained from Sanger sequencing, Maxam-Gilbert sequencing, shotgun (shotgun) sequencing, pyrosequencing, combinatorial probe-anchored synthesis, or ligase sequencing. In some embodiments, the system may be configured to access data obtained from de novo sequencing of peptides. As an example, the system can access amino acid sequences obtained from tandem mass spectrometry. Some embodiments are not limited to a particular target sequencing technique.

Next, the process 300 proceeds to block 304, at block 304, the system accesses an assembly generated from at least a portion of the sequencing data obtained at block 302. In some embodiments, the system may be configured to access assemblies obtained by applying assembly algorithms (e.g., OLC assembly, DBG assembly) to sequencing data. In some embodiments, the system may be configured to access the assembly by applying an assembly algorithm to the sequencing data. In some embodiments, the system may be configured to access predetermined assemblies generated by applying one or more assembly algorithms to the sequencing data. As an example, these assemblies may have been previously performed by a separate computing device and stored in a database. For example, the database from which the sequencing data is obtained may also store assemblies generated by applying one or more assembly algorithms to the sequencing data.

In some embodiments, the system may be configured to access assemblies generated according to a target assembly technique such that a machine learning model may be trained to correct errors that are characteristic of the target assembly technique. The machine learning model may be trained on an error distribution of the target assembly technique such that the machine learning model may be optimized to correct error characteristics of the target assembly technique. In some embodiments, the system may be configured to access assemblies generated by a particular assembly algorithm and/or software application. As an example, the system may access assemblies generated by Canu, Miniasm, or Flye assemblers. In some embodiments, the system may be configured to access assemblies generated from a class of assemblers. As an example, the system may access assemblies generated from a greedy algorithm assembler or a diagramming assembler. Some embodiments are not limited to a particular assembly technique.

Next, process 300 proceeds to block 306, at block 306, the system accesses one or more predetermined assemblies of reference macromolecules. In some embodiments, the predetermined assembly of the reference macromolecules may represent the actual or correct assembly of the corresponding macromolecules. As such, the system may be configured to tag the training data with a predetermined assembly of reference macromolecules. As an example, the system may access a reference genome of organism DNA from the NCBI database. In this example, the system can use the reference genome to determine markers for performing supervised learning to train a machine learning model for identifying nucleotides in the genome assembly. As another example, the system can access a reference protein sequence of a protein from the UnitProt database and use the reference protein sequence to determine a label for performing supervised learning to train a machine learning model for identifying amino acids in the protein sequence.

Next, the process 300 proceeds to block 308, at block 308, the system trains the machine learning model using the data accessed at block 302 and 308. In some embodiments, the system may be configured to: (1) generating input for a machine learning model using the sequencing data accessed at block 302 and the assembly accessed at block 304; (2) tagging the generated input with the predetermined assembly accessed at block 306; and (3) applying a supervised learning algorithm to the labeled training data. In some embodiments, the system may be configured to generate the input to the machine learning model by generating values for one or more features using the sequencing data. In some embodiments, the system may be configured to determine a value of the feature for each location in the assembly. As an example, the system may determine the feature value of the location by: (1) determining counts for the respective nucleotides, wherein each count is indicative of a number of nucleotide sequences that is indicative of the presence of a nucleotide at that position; and (2) determining a value of the feature using the count. Example techniques for generating input and tagging the input are described herein with reference to fig. 4A-C.

In some embodiments, the system may be configured to train the deep learning model using the labeled training data. In some embodiments, the system may be configured to train the decision tree model using the labeled training data. In some embodiments, the system may be configured to train a Support Vector Machine (SVM) using the labeled training data. In some embodiments, the system may be configured to train a naive bayes classifier using the labeled training data (a)Bayes classifier,NBC)。

In some embodiments, the system may be configured to train the machine learning model by using a stochastic gradient descent. The system may iteratively change parameters of the machine learning model to optimize an objective function to obtain a trained machine learning model. For example, the system may use random gradient descent to train the filter of the convolutional network and/or the weights of the neural network.

In some embodiments, the system may be configured to perform supervised training using the labeled training data. In some embodiments, the system may be configured to train the machine learning model by: (1) providing the generated inputs to a machine learning model to obtain corresponding outputs; (2) identifying a biopolymer present at the assembled location using the output; and (2) training a machine learning model based on differences between the identified biopolymer and the biopolymer indicated at the location in the reference assembly. The biopolymer indicated at the position in the reference assembly may be a marker for the respective input. This difference may provide a measure of how well the machine learning model performed to render the mark when configured with its current set of parameters. As an example, the parameters of the machine learning model may be updated using random gradient descent and/or any other iterative optimization technique suitable for training the model. As an example, the system may be configured to update one or more parameters of the model based on the determined difference.

In some embodiments, the system may apply an unsupervised training algorithm to a set of unlabeled training data. Although the embodiment of fig. 3A includes accessing predetermined assemblies of reference macromolecules at block 306, in some embodiments, the system may be configured to perform training without accessing the predetermined assemblies. In these embodiments, the system may be configured to apply an unsupervised training algorithm to the training data to train the machine learning model. The system may be configured to train the model by: (1) generating an input of a model using the sequencing data and an assembly generated from the sequencing data; and (2) applying an unsupervised training algorithm to the generated input. In some embodiments, the machine learning model may be a clustering model, and the system may be configured to identify clusters of the clustering model by applying an unsupervised learning algorithm to the training data. Each cluster can be associated with a biopolymer (e.g., a nucleotide or an amino acid). As an example, the system may perform k-means clustering to identify clusters (e.g., cluster centers) using training data.

In some embodiments, the system may be configured to apply a semi-supervised learning algorithm to the training data. The system can: (1) labeling a set of unlabeled training data by applying an unsupervised learning algorithm (e.g., clustering) to the training data; and (2) applying a supervised learning algorithm to the labeled training data. As an example, the system can apply k-means clustering to inputs generated from sequencing data and assemblies obtained from sequencing data to cluster the inputs. The system may then tag each input with a classification based on cluster membership. Subsequently, the system can train the machine learning model by applying a stochastic gradient descent algorithm and/or any other iterative optimization technique to the labeled data.

After training the machine learning model at block 308, the process 300 ends. In some embodiments, the system may be configured to store the trained machine learning model. The system may store values of one or more trained parameters of the machine learning model. As an example, the machine learning model may include one or more neural networks, and the system may store values of trained weights of the neural networks. As another example, the machine learning model includes a convolutional neural network, and the system may store one or more trained filters of the convolutional neural network. In some embodiments, the system may be configured to store the trained machine learning model (e.g., in the assembly system 104) for use in generating assemblies (e.g., genomic assemblies, protein sequences, or portions thereof).

In some embodiments, the system may be configured to obtain new data to update the machine learning model with the new training data. In some embodiments, the system may be configured to update the machine learning model by training a new machine learning model using the new training data. As an example, the system may train a new machine learning model using the new training data. In some embodiments, the system may be configured to update the machine learning model by retraining the machine learning model with the new training data to update one or more parameters of the machine learning model. As an example, the output generated by the model and the corresponding input data may be used as training data as well as previously obtained training data. In some embodiments, the system may be configured to iteratively update the trained machine learning model using data identifying amino acids and output (e.g., obtained from performing process 310 described below with reference to fig. 3B). As an example, the system may be configured to provide input data to a first trained machine learning model (e.g., a teacher model) and obtain an output identifying one or more amino acids. Subsequently, the system can retrain the machine learning model using the input data and the corresponding output to obtain a second trained machine learning model (e.g., a student model).

In some embodiments, the system may be configured to train a separate machine learning model for each of a plurality of sequencing techniques. Data obtained from a sequencing technique can be used to train a machine learning model for the respective sequencing technique. The machine learning model may be adjusted for error distributions of the sequencing technique. In some embodiments, the system may be configured to train a separate machine learning model for each of a plurality of assembly techniques. The assemblies obtained from the assembly techniques may be used to train a machine learning model for the respective assembly techniques. The machine learning model may be adjusted for an error profile of the assembly technique.

In some embodiments, the system may be configured to train a general purpose machine learning model to be used for a plurality of sequencing techniques. The generic machine learning model may be trained using data aggregated from multiple sequencing techniques. In some embodiments, the system may be configured to train a generic machine learning model to be used for a variety of assembly techniques. The generic machine learning model may be trained using assemblies generated using a variety of assembly techniques.

Fig. 3B illustrates an exemplary process 310 for generating an assembly (e.g., a genomic assembly, a gene sequence, a protein sequence, or a portion thereof) using a trained machine learning model obtained from process 300, in accordance with some embodiments of the techniques described herein. Process 310 may be performed by any suitable computing device. As an example, the process 310 may be performed by the assembly system 104 described above with reference to fig. 1A-C.

The process 310 begins at block 312, the system performs an assembly algorithm (e.g., OLC assembly or DBG assembly) on the sequencing data to generate an assembly. As an example, the system may apply an assembly algorithm to a nucleotide sequence generated by sequencing a DNA sample. As another example, the system may apply an assembly algorithm to amino acid sequences generated by sequencing of peptide samples from proteins. The system may apply an assembly algorithm as described above with reference to assembler 200C of fig. 2A-D. In some embodiments, the system may include an assembly application. The system may be configured to generate the assembly by executing an assembly application. Examples of assembly applications are described herein.

As indicated by the dashed line around block 312, in some embodiments, the system may not perform the assembly algorithm. The system may obtain assemblies generated by separate systems (e.g., separate computing devices) and perform the steps of block 314 and 322 to update the obtained assemblies.

Next, process 310 proceeds to block 312, where the system accesses sequencing data and assembles. In some embodiments, the system may be configured to access assemblies generated by the system (e.g., at block 312). In some embodiments, the system may be configured to access assemblies generated by individual systems. As an example, the system may receive an assembly generated by a software application executing on a computing device separate from the system. In some embodiments, the system may be configured to access sequencing data generated according to a target assembly technique (e.g., an algorithm and/or software application in which the machine learning model trained in process 300 has been optimally updated (e.g., to correct errors)). As an example, a machine learning model can be trained on assemblies generated from a Canu assembly application, and the system can access assemblies generated by the Canu assembly application.

In some embodiments, the system may be configured to access sequencing data as follows: the sequencing data includes assembled biopolymer sequences used to generate the access. As an example, the accessed sequencing data may include nucleotide sequences to which assembly algorithms are applied to generate genomic assemblies or gene sequences. As another example, the accessed sequencing data can include an amino acid sequence to which an assembly algorithm is applied to generate a protein sequence. In some embodiments, the system can be configured to access sequencing data generated from a target sequencing technique that the machine learning model trained in process 300 has been optimized to update. As an example, a machine learning model can be trained on sequencing data generated from third generation sequencing, and the system can access sequencing data generated from third generation sequencing.

Next, the process 310 proceeds to block 316, where the system uses the sequencing data and the assembly to generate input to be provided to the machine learning model. In some embodiments, the system may be configured to generate inputs for respective locations in the assembly. The system may be configured to generate inputs for a set of locations in the assembly by: (1) aligning the sequence from the sequencing data to the set of positions in the assembly; and (2) comparing the biopolymers of the aligned sequences with the biopolymers indicated at these positions in the assembly to determine the value of one or more features. In some embodiments, the system may be configured to align the sequence to a set of locations in the assembly by identifying, from the sequencing data, a sequence indicative of a biopolymer at the set of locations in the assembly. As an example, assembly may include positions indexed from 1 to 10000, and the system may determine that the nucleotide sequences "TAGGTC", "taggttc", "TAGGCC", "TAGGTC" are aligned with positions indexed 5-10 in assembly, respectively. In this example, the system can compare each nucleotide sequence to the biopolymers indicated at the positions indexed 5-10 in the assembly to determine the value of the feature. Examples of features and generation of feature values are described with reference to fig. 4A-C.

In some embodiments, the system may be configured to generate inputs for respective locations in the assembly. The system can be configured to generate input for a position to provide as input to a machine learning model to obtain an output that can be used to identify a biopolymer (e.g., nucleotide, amino acid) present at the position in the assembly. In some embodiments, the system may be configured to generate an input for a location in the assembly based on the biopolymer indication at the location and the biopolymer indication at one or more other locations in the vicinity of the location. The input may provide background information around the location in the assembly to a machine learning model, which uses the background information to generate a corresponding output. The system may be configured to generate an input for the location based on the indication of the biopolymer at the location in the neighborhood of the location by determining values of the features at the location and at other locations in the neighborhood of the location. As an example, the system may: (1) selecting a position; (2) identifying a neighborhood of locations centered at the selected location; and (3) generating the input as a value of the feature at each of the selected location and the neighborhood of locations.

In some embodiments, the system may be configured to use a neighborhood of a set size. Example neighborhood sizes are described herein. In some embodiments, the number of locations in the neighborhood used by the system may be a configurable parameter. For example, the system may receive user input (e.g., in a software application) specifying a neighborhood size to be used. In some embodiments, the system may be configured to determine a neighborhood size. As an example, the system may determine the neighborhood size based on a sequencing technique by which sequencing data is generated and/or an assembly technique by which assembly is generated.

In some embodiments, the system may be configured to generate the input to be provided to the machine learning model by: (1) selecting a location in the assembly; and (2) generating a corresponding input for the selected location. In some embodiments, the system may be configured to select a location in the assembly by: determining a likelihood that the assembly incorrectly indicates biopolymers at these locations in the assembly, and using the determined likelihood to select a location for which to generate an input. As an example, the system may determine whether the likelihood of the assembly incorrectly indicating a biopolymer at a location exceeds a threshold likelihood, and generate an input for the location if the likelihood exceeds the threshold likelihood. In some embodiments, the system may be configured to determine a likelihood that the location falsely indicates a biopolymer based on a plurality of aligned sequences indicating that the biopolymer is present at the location. The system can determine a likelihood indicative of a difference between the number of sequences at the location and the total number of sequences for the biopolymer. As an example, based on agreement from a set of 9 nucleotide sequences, an assembly may indicate that thymine is present at a position in the assembly, where a 4 nucleotide sequence indicates the presence of thymine at the position, a 2 nucleotide sequence indicates the presence of guanine at the position, and a 3 nucleotide sequence indicates the presence of adenine at the position. In this example, the system may determine that the likelihood that the assembly incorrectly indicates a biopolymer at the location in the assembly is the difference between the number of nucleotide sequences indicative of thymines (4) and the total number of nucleotide sequences (9) to obtain a value of 5. The system may determine that 5 is greater than a threshold difference (e.g., 1,2, 3, 4) and, as a result, generate an input for the location.

In some embodiments, the system may be configured to use threshold differences of 1,2, 3, 4, 5, 6, 7, 8, 9, or 10. In some embodiments, the threshold difference may be a configurable parameter. The threshold likelihood used by the system may affect the number of locations for which the system generates inputs to be provided to the model. As an example, the system may receive a value of the threshold as user input to the software application. In some embodiments, the system may use a set threshold likelihood. As an example, the value of the threshold likelihood may be encoded. In some embodiments, the system may be configured to automatically determine the threshold likelihood. As an example, the system may determine the threshold likelihood based on an assembly technique from which the assembly is generated and/or a sequencing technique from which sequencing data is generated.

In some embodiments, the system may be configured to generate inputs for the positions of the 2-D matrix. In some embodiments, each row/column of the matrix may specify the value of the feature determined at the corresponding location in the assembly. In some embodiments, the system may be configured to generate the input as an image, wherein pixels of the image hold values of the feature. As an example, each row/column of the image may specify a value of the feature determined at a respective location in the assembly.

Next, the process 310 proceeds to block 318, where the system provides the input generated at block 316 to the machine learning model to obtain a corresponding output. In some embodiments, the system may be configured to provide the inputs generated for the respective locations in the assembly as separate inputs to the machine learning model. As an example, the system may provide a set of feature values determined at a target location and locations in the neighborhood of the location as inputs to a machine learning model to obtain a corresponding output for the target location. In some embodiments, the system may be configured to provide the inputs generated for multiple locations in parallel (e.g., as described above with reference to fig. 2C-D). As an example, the system may:

(1) providing a first input generated for a first location to a model; and (2) providing a second input generated for the second location to the model prior to obtaining the first output corresponding to the first input. In some embodiments, the system may be configured to provide the inputs generated for the plurality of locations in sequence. For example, the system may: (1) providing a first input generated for a first location to a model to obtain a corresponding first output; and (2) after obtaining the first output, providing a second input for the second location to obtain a corresponding second output.

In some embodiments, the output corresponding to the input provided to the machine learning model may indicate, for each of a plurality of locations in the assembly, a likelihood that each of the one or more biopolymers is present at that location. As one example, for each of a plurality of locations in a genome assembly, the output can indicate a likelihood (e.g., a probability) that each of one or more nucleotides (e.g., adenine, guanine, thymine, cytosine) is present at the location. As another example, for each of a plurality of positions in a protein sequence, the output may indicate a likelihood that each of one or more amino acids is present in that position. In some embodiments, the output may indicate a likelihood that no biopolymer is present at the location in the assembly. As an example, the system may indicate the likelihood of the "-" character being at that location in the assembly.

In some embodiments, the model may provide outputs corresponding to respective locations in the assembly. The system may provide inputs generated for a target location in an assembly and obtain corresponding outputs indicating a likelihood that each of the one or more biopolymers is present at the target location. As an example, the system can provide inputs generated for a location in a genome assembly and obtain a corresponding output indicating a likelihood that each nucleotide in a set of 4 possible nucleotides (e.g., adenine, guanine, thymine, cytosine) is present at the location. For example, the likelihood may be the probability value that each nucleotide is present at that position.

Next, process 310 proceeds to block 320, where the system uses the output obtained from the model to identify the biopolymer at the location in the assembly. In some embodiments, the system may be configured to identify the biopolymer at the location in the assembly by identifying, for each of the locations, the biopolymer present at the location using outputs obtained for the location in response to respective inputs provided to the model. The output from the model may include a plurality of sets of output values corresponding to respective locations. Each set of output values may specify a likelihood that each of the one or more biopolymers is present at a respective location in the assembly. The system may identify the biopolymer at the respective location as the biopolymer having the greatest likelihood of being present at that location. As an example, a set of output values for a first location in an assembly may indicate the following set of possibilities for that location: adenine (A)0.1, cytosine (C)0.6, guanine (G)0.1, thymine (T)0.15 and blank (-) 0.05. In this example, the system can identify cytosine (C) at that location in the assembly. In some embodiments, the output from the model corresponding to the input generated for a location may be a classification specifying a biopolymer at the location. As an example, the output from the model may be a classification of adenine (a), cytosine (C), guanine (G), thymine (T) or blank (-).

Next, process 310 proceeds to block 322, where the system updates the assembly to obtain an updated assembly. The system may be configured to update the assembly based on the biopolymer identified at block 320. In some embodiments, the system may be configured to update the assembly by updating the indication of the biopolymer at the location in the assembly. In some cases, the biopolymer identified as being present at the location at block 320 may be different from the biopolymer indication in the assembly. In these cases, the system may modify the biopolymer indication at the location in the assembly. As an example, the system may:

(1) using the output of the model, identifying that thymine "T" is present at a first position in the assembly indicated by adenine "a"; and (2) changing the indication of the first position in the assembly from the previous indication adenine "a" to thymine "T". In some cases, the biopolymer identified as being present at a location may be indicated as the same as the biopolymer at that location in the assembly. In these cases, the system may not alter the biopolymer indication at that location in the assembly. As an example, the system may: (1) using the output of the model, identifying that thymine "T" is present at a first position in the assembly indicated by thymine "T"; and (2) leaving the indication at the first position unchanged.

In some embodiments, the system may be configured to update multiple locations in the assembly in parallel. As an example, the system may: (1) beginning to update the first location in the assembly; and (2) begin updating the second location in the assembly before completing the updating at the first location. In some embodiments, the system may be configured to sequentially update the locations in the assembly. As an example, the system may:

(1) updating a first location in the assembly; and (2) updating the second location in the assembly after the updating at the first location in the assembly is completed.

In some embodiments, after updating the assembly at block 322 to obtain the first updated assembly, process 310 may return to block 316, as indicated by the dashed line from block 322 to block 316. In some embodiments, the system may be configured to generate input for the machine learning model using the first update assembly and sequencing data. As an example, the system can generate an input to the model using a set of nucleotide sequences of sequencing data and the first updated assembly. The system can align the nucleotide sequence with the corresponding positions of the first updated assembly to generate an input for the machine learning model as described above. The system may then perform actions at blocks 316 through 322 to obtain a second update package. In some embodiments, the assembly system may be configured to perform iterations until a condition is satisfied.

In some embodiments, the system may be configured to perform the update iteration until the system determines that a threshold number of iterations have been performed. In some embodiments, the threshold number of iterations may be set by a user input (e.g., a software command or a hard-coded value). In some embodiments, the system may be configured to determine a threshold number of iterations. As an example, the system may determine a threshold number of update iterations based on the type of assembly technique used to obtain the initial assembly. In some embodiments, the system may be configured to perform an update iteration until the system detects that the assembly has converged. As an example, the assembly system may: (1) determining a number of differences between a current assembly and a previous assembly obtained from a most recent iteration; and (2) determine to stop performing the update iteration when the number of differences is less than a threshold number or percentage of differences.

In some embodiments, the system may be configured to perform a single update to the assembly, and process 310 may end at block 322 after performing the single update to the assembly. The updated assembly may be output by the system as an output assembly. As an example, the system may output a genome assembly in which errors in the assembly have been corrected, such that the output assembly is more accurate than the initial assembly accessed at block 314. As another example, the system may output a protein sequence in which the error has been corrected such that the output protein sequence is more accurate than the initial protein sequence accessed at block 314.

In some embodiments, the system may be configured to perform a first number of update iterations for a first portion of the assembly and a second number of update iterations for a second portion of the assembly. As an example, the system may update the location of the genome assembly indexed 1-100 multiple times (e.g., by performing multiple iterations of the actions at block 316-322) and update the location of the genome assembly indexed 101-200 once (e.g., by performing the actions at block 316-322 once). The system may be configured to determine the assembled portion to update multiple times based on a number of locations in the assembled portion that may falsely indicate a biopolymer. As an example, the system may: (1) determining a number of locations (e.g., 25, 50, 75, 100, or 1000 locations) in the location window having a likelihood of an incorrect biopolymer indication exceeding a threshold likelihood; and (2) determining to perform an update cycle of the window of locations when the number exceeds a threshold number of locations.

4A-C illustrate examples of generating inputs to be provided to a machine learning model in accordance with some embodiments of the technology described herein.

Fig. 4A shows an array 400 comprising a nucleotide sequence 401 (labeled "stack barrier" in fig. 4A), an assembly 402 of biopolymers generated from the nucleotide sequence 401, and labels 404 for the biopolymers at corresponding positions in the assembly. As an example, the data shown in fig. 4A may be training data obtained from performing a process 300 for training a machine learning model, wherein: (1) obtaining sequencing data 401 and assembly 402 at blocks 302 and 304; (2) and a flag 404 is obtained at block 306. As another example, the sequencing data 401 and the assembly 402 may be obtained at blocks 312 and/or 314 of the process 310 of generating an assembly using a trained machine learning model.

As shown in the example of fig. 4A, sequencing data 401 includes a nucleotide sequence generated from sequenced DNA. Each row of sequencing data 401 is a nucleotide sequence. As shown in the example of fig. 4A, the nucleotide sequence is represented as a sequence of alphanumeric characters, where "a" represents adenine, "C" represents cytosine, "G" represents guanine, "T" represents thymine, and "-" represents the absence of a nucleotide at that position. The exemplary alphanumeric characters described herein are for illustrative purposes, as some embodiments are not limited to a particular set of alphanumeric characters to represent the corresponding nucleotides or deletions thereof.

In the embodiment of fig. 4A, the assembly 402 is generated from a nucleotide sequence 401. In some embodiments, the assembly 402 may be obtained by applying an assembly algorithm (e.g., OLC assembly) to the sequencing data 401. In the example of fig. 4A, the assembly 402 is obtained by taking nucleotide sequence identity. The identity is determined by majority voting on the nucleotide sequence at each position in the assembly 402, where the system identifies the biopolymer indicated by the maximum number of nucleotide sequences at that position. The system may be configured to, for each of a plurality of nucleotides: (1) determining the number of nucleotide sequences that vote for a nucleotide (e.g., by indicating that a nucleotide is present at that position); and (2) identifying the nucleotide with the greatest number of votes to indicate at that position. By way of example, for the location of the highlighted column 406: (1) 4 of the sequences indicate adenine, 3 of the sequences indicate cytosine, and 2 of the sequences indicate guanine; and (2) the position in assembly 402 indicates adenine. As another example, for a first position in assembly 402, all nucleotide sequences indicate a cytosine, so assembly 402 indicates that the cytosine is at the first position.

In the embodiment of fig. 4A, the marker 404 may indicate a desired biopolymer for a location in the assembly 402. In some embodiments, the system can be configured to determine the marker from a reference genome. For example, the system can obtain nucleotide sequences from sequencing of DNA samples from an organism, assembly 402 by applying an assembly algorithm to the nucleotide sequences, and markers 404 from a known reference genome of the organism (e.g., from the NCBI database). The markers 404 may represent true or correct biopolymer indications for each location for supervised training and/or for determining the accuracy of the generated assembly.

FIG. 4B illustrates an array 410 of values determined from the data 400 shown in FIG. 4A. Array 410 shows intermediate steps in generating inputs for the machine learning model for assembling the locations of columns 406 in 402. Array 410 includes a set of rows labeled "heap" which represents the nucleotide sequence of FIG. 4A. For each position in the assembly, the system determines a count for each nucleotide in the plurality of nucleotides, wherein the count indicates a number of nucleotide sequences that indicate the position of the nucleotide in the assembly. Each entry in the "stack" portion of the array 410 holds a count of nucleotides. As an example, column 412 in fig. 4B has a count for adenine of 4, a count for cytosine of 3, a count for guanine of 2, a count for thymine of 0, and a count for no nucleotide of 0. As another example, the first column of array 410 has a count of 0 for adenine, 9 for cytosine, 0 for guanine, 0 for thymine and 0 for no nucleotides.

Array 410 also includes a set of rows labeled "pack" in FIG. 4B, which represent pack 402 of FIG. 4B. For each position in the assembly 402, the array 410 includes a column of values determined from the nucleotide indicated by that position. For each position, the system can assign a reference value to each of the plurality of nucleotides, wherein the reference value indicates whether the nucleotide is indicated at the position in the assembly. As an example, in the column labeled 412 in fig. 4B, the assembly part: (1) the value for adenine is 9 because it is the nucleotide indicated at the corresponding position in assembly 402; and (2) a value of 0 for each of the other nucleotides, as they are not indicated at the corresponding position in the assembly 402. As another example, the first column assembly portion of array 410: (1) the value for cytosine is 9 because it is the nucleotide indicated at the corresponding position in assembly 402; and (2) a value of 0 for each of the other nucleotides, as they are not indicated at the corresponding position in the assembly 402. As shown in the example of fig. 4B, in some embodiments, when a nucleotide is indicated at an assembly position, the reference value assigned to the nucleotide at the assembly position is equal to the number of aligned nucleotide sequences (e.g., 9 in the example of fig. 4A).

Fig. 4C shows an array 420 of feature values generated using the values in array 410 of fig. 4B. In some embodiments, the array 420 may be provided as an input to a machine learning model to obtain a corresponding output. In the example of fig. 4C, array 420 is the input to be provided to the model for the position in the assembly corresponding to column 422. Array 420 includes feature values determined at a target location corresponding to column 422, and feature values determined for 24 locations in the neighborhood of the target location. Array 420 includes feature values for 12 locations to the left of the target location and 12 locations to the right of the target location.

In the stack portion of the array 420, each column specifies an error value for each of the plurality of nucleotides. Error values for nucleotides in this column indicate differences between: (1) a number of nucleotide sequences indicating the position of a nucleotide in the assembly 402 corresponding to the column, and (2) a reference value assigned to the nucleotide in the assembled portion of the array 420. By way of example, for column 422 of FIG. 4C, the values are determined as follows: (1) adenine is 4-9 ═ 5(2) cytosine is 3-0 ═ 3(3) guanine is 2-0 ═ 2; (4) thymine 0-0 ═ 0(5) blank 0-0 ═ 0. The assembled portion of array 420 may be the same as the assembled portion of array 410 of fig. 4B.

In some embodiments, the value of the heap in the array 420 may indicate the likelihood of the assembly 402 erroneously identifying a nucleotide at a location. The system may use these values to select the location of the input for which the machine learning model is to be generated. As shown in fig. 4C, the non-zero value of the stack is highlighted. In some embodiments, the system may be configured to determine to generate an input to be provided to a machine learning model for the location when the heap value at the location exceeds a threshold. For example, the system may determine to generate an input for the location in the assembly 402 corresponding to the column 422 by determining that a difference of 5 determined for adenine exceeds a threshold difference of 4.

In some embodiments, the array 420 may be provided as an input to a machine learning model to update the locations in the assembly (e.g., the locations corresponding to the column 422). The system can use the corresponding output obtained from the machine learning model to identify the nucleotide present at the location in the assembly and update the assembly accordingly. In some embodiments, the array 420 may be one of a plurality of inputs provided to the machine learning model as part of training the model. The system may use the respective outputs obtained from the machine learning model and the label 404 to determine adjustments to one or more parameters of the machine learning model. As an example, the machine learning model may be a neural network, and the system may use differences between nucleotides and labels identified from the output of the machine learning model to determine one or more adjustments to the weights of the neural network.

Although the exemplary embodiment of fig. 4A shows data relating to nucleic acids, in some embodiments, the data may relate to proteins. For example, sequence 401 may be an amino acid sequence, assembly 402 may be a protein sequence, and marker 404 may be a reference amino acid for each position in the protein sequence. The system can determine the values shown in FIGS. 4B-C based on amino acid sequence, protein sequence, and/or markers.

Fig. 5 illustrates a process of updating an assembly in accordance with some embodiments of the technology described herein. FIG. 5 illustrates generating inputs from the assembly data 500 to be provided to the machine learning model 502 to generate updated assemblies 508. For example, the assembly data 500 may be in the form of the data described above with reference to fig. 4C. The illustrated update process may be performed by the assembly system 104 described above with reference to fig. 1A-C.

As shown in the embodiment of FIG. 5, the system selects locations 504A and 506A in the assembly to be updated. By way of example, the system may select the locations 504A, 506A by: (1) determining a likelihood that assembly incorrectly indicates a biopolymer (e.g., nucleotide, amino acid) at a position in assembly; and (2) determine that the likelihood at locations 504A, 506A exceeds a threshold likelihood, respectively, to select locations 504A, 506A. When the system selects the locations 504A, 506A, the system may determine to generate corresponding inputs to be provided to the machine learning model 502.

As shown in the embodiment of FIG. 5, the system generates a first input 504B corresponding to location 504A and a second input 506B corresponding to location 506A. The system may generate each of the inputs 504B, 506B as described above with reference to fig. 4A-C. For example, the system may generate each of the inputs 504B, 506B by: (1) selecting a neighborhood of locations centered at the location; (2) determining a value of one or more features at each location in the neighborhood; and (3) using the value of the feature as an input for the location. In some embodiments, the system may be configured to store the values of the features in a data structure. By way of example, the system may store the values in a two-dimensional array or image, as shown in FIG. 4C.

As shown in the embodiment of fig. 5, the system provides each of the generated inputs 504B, 506B as inputs to the machine learning model 502 to obtain corresponding outputs. Output 504C corresponds to input 504B generated for location 504A, and output 506C corresponds to input 506B generated for location 506A. In some embodiments, the system may be configured to provide the inputs 504B, 506B to the machine learning model 502 in sequence. As an example, the system may: (1) providing the input 504B to the machine learning model 502 to obtain a corresponding output 504C; and (2) after obtaining the output 504C, provide the input 506B to the machine learning model 502 to obtain a corresponding output 506C. In some embodiments, the system may be configured to provide the inputs 504B, 506B to the machine learning model 502 in parallel. As an example, the system may: (1) providing input 504B to machine learning model 502; and (2) provide input 506B to machine learning model 502 before obtaining output 504C corresponding to input 504B.

As shown in the embodiment of fig. 5, each of the outputs 504C, 506C indicates a likelihood that each of the one or more nucleotides is present at the position in the assembly. In the embodiment of fig. 5, the likelihood is a probability. By way of example, output 504C specifies: (1) for each of four different nucleotides, the probability that the nucleotide is present at position 504A; and (2) the probability (represented by the "-" character) that no nucleotide is present at position 504A. In output 504C, the probability of adenine is 0.2, the probability of cytosine is 0.5, the probability of guanine is 0.1, the probability of thymine is 0.1, and the probability of no nucleotide at position 504A is 0.1. As another example, output 506C specifies: (1) the probability that, for each of four different nucleotides, that nucleotide is present at position 506A; and (2) the probability (represented by the "-" character) that no nucleotide is present at position 506A. In this example, the probability of adenine is 0.6, the probability of cytosine is 0.1, the probability of guanine is 0.2, the probability of thymine is 0.05, and the probability of no nucleotide at position 504A is 0.05.

As shown in the embodiment of FIG. 5, the system updates the position in the assembly using the output obtained from the machine learning model 502 to obtain an updated assembly 508. In some embodiments, the system may be configured to update the assembly by: (1) identifying nucleotides present at the locations using outputs obtained from a machine learning model; and (2) updating the position in the assembly to indicate the identified nucleotide, thereby obtaining an updated assembly 508. As shown in the example of fig. 5, the system updates the location 504A in the initial assembly by: (1) using output 504C to determine that cytosine has the highest likelihood of being present at the location; and (2) setting the corresponding position 508A in the updated assembly 508 to indicate that cytosine "C" is at that position. As another example, the system updates the location 506A in the initial assembly by: (1) using output 506C to determine that adenine has the highest likelihood of being present at the location; and (2) setting the corresponding position 508B in the updated assembly 508 to indicate adenine "a". In some cases, the system may: (1) determining that the nucleotide identified at the location using the output obtained from the machine learning model 502 is likely to have been indicated at the location; and (2) keep the indication at that location unchanged in the updated assembly 508.

Although the updated assembly 508 is shown as being separate from the initial assembly, in some embodiments, the updated assembly 508 may be an updated version of the initial assembly. For example, the system may store the initial assembly in memory and update the values of the initial assembly in memory to obtain an updated assembly 508. In some embodiments, the system may generate the updated assembly 508 as a different assembly than the initial assembly. For example, the system may store the initial assembly at a first memory location and the updated assembly 508 as a separate assembly at a second memory location.

In some embodiments, the system may be configured to perform the updates sequentially at the locations in the initial assembly. As an example, the system may: (1) update the location 508A in the updated assembly 508 using the output 504C; and (2) after the update is completed at location 508A, update location 508B in the updated assembly 508 using output 506C. In some embodiments, the system may be configured to perform the updates in parallel at the locations in the initial assembly. As an example, the system may: (1) begin updating location 508A using output 504C; and (2) begin updating location 508B using output 506C before the update is completed at location 508A.

In some embodiments, the system may be configured to perform the processes of generating inputs for respective locations in the assembly, providing the inputs to the machine learning model 502, and updating the locations in the assembly using outputs from the machine learning model in parallel. As an example, the system may: (1) begin generating input for initially assembled position 504A; and (2) begin generating inputs for the initially assembled location 506A before the update of the location at location 504A is completed. By parallelizing the assembly update, the system makes the process of generating the assembly more efficient (e.g., by requiring less time). The system may parallelize the process by using multiple processors and/or using multiple application threads.

Although the embodiment of fig. 5 shows updating a portion of the genome assembly, some embodiments may implement the process shown to update a protein sequence or portion thereof. For example, the initial assembly may be a protein sequence. Subsequently, the system can generate inputs for locations in the protein sequence to provide to the machine learning model 502. The system can obtain an output indicative of a likelihood (e.g., probability) that each of the plurality of amino acids is present at the position. Subsequently, the system can update the initial protein sequence to obtain an updated protein sequence.

Fig. 6 illustrates an example of a convolutional neural network model 600 for generating assemblies in accordance with some embodiments of the technology described herein. In some embodiments, the convolutional neural network model 600 may be trained by performing the process 300 described above with reference to fig. 3A. In some embodiments, the trained convolutional neural network model 600 obtained from process 300 may be used to perform process 310 to generate an assembly as described above with reference to fig. 3B.

In some embodiments, model 600 is configured to receive input generated from sequencing data, and assemblies generated from sequencing data. As an example, the model 600 may be a machine learning model used by the assembly system 104 described above with reference to fig. 1A-C. Sequencing data may include biopolymer sequences (e.g., nucleotide sequences or amino acid sequences). In some embodiments, the system may be configured to determine values of one or more features and provide the determined values as input to the model 600. As an example, the system may determine values of features at a neighborhood of the location in the assembly and provide the values at the neighborhood of the determined location as input to the model 600. Example inputs and techniques for generating inputs are described herein.

In the exemplary embodiment of FIG. 6, model 600 includes a first convolution layer 602 that receives input provided to model 600. In the first layer 602, the system convolves the input provided to the model 600 with 64 3 × 5 filters (Conv3 × 5 × 64) represented as a 3 × 5 × 64 matrix. For example, the system may convolve a 10 × 25 input matrix (e.g., as shown in fig. 4C) with each channel of a 3 × 5 × 64 matrix to obtain an output. The first layer 602 includes a modified linear function (ReLu) as the activation function that the system applies to the output from the convolution. In some embodiments, the first layer 602 may also include a pooling layer to reduce the size of the convolved outputs.

In the exemplary embodiment of fig. 6, the model includes a second convolutional layer 604, which receives the output of the first layer 602. In the second layer 604, the system convolves the input with a set of 128 3 × 5 filters (Conv3 × 5 × 128) represented as a 3 × 5 × 128 matrix. The system may convolve the output from the first convolution layer 602 with a 3 x 5 x 128 filter bank. The second convolutional layer 604 includes the ReLU function as the activation function that the system applies to the output from the convolution. In some embodiments, the second layer 604 may also include a pooling layer to reduce the size of the convolved outputs. The output of second convolutional layer 604 is then passed to third convolutional layer 606. In the third layer 606, the system convolves the input with a set of 256 3 × 5 filters, represented as a 3 × 5 × 256 matrix. The system then applies the ReLu activation function to the output from the convolution. In some embodiments, the third layer 606 may also include a pooling layer to reduce the size of the convolved outputs.

In the exemplary embodiment of FIG. 6, model 600 includes dense layer (dense)608 having 5 fully connected layers, where each layer receives 256 input values. The system may compress the output obtained from the third convolutional layer 606 to provide as input to the dense layer 608. The dense layer 608 may output a plurality of values, where each value indicates a likelihood that a corresponding biopolymer (e.g., nucleotide or amino acid) is present at a location for which input is provided to the model 600. As an example, the dense layer may output five values, where each value indicates a likelihood that a nucleotide (e.g., adenine, cytosine, guanine, thymine, and/or no nucleotides) is present at the position. The system may apply a normalized exponential function (softmax) function to the output of the dense layer 608 to obtain a set of probability values with a sum of 1. As shown in the exemplary embodiment of fig. 6, the system applies a softmax function to the output of the dense layer 608 to obtain an output 610 of 5 probabilities indicating the probability of the respective nucleotide being present at the position in the assembly. The output 610 may be used to update the assembly (e.g., as described above with reference to fig. 5).

Fig. 7 illustrates a result of an execution of a technique according to some embodiments of the techniques described herein. Each graph shows the improvement in accuracy provided by this technique over conventional techniques. In fig. 7, Canu and minism are two conventional assembly techniques. Minism + Racon denotes minism to which Racon error correction is applied. Canu + Quorum is an implementation of the assembly technique described herein for revising generation from Canu. Minism + Quorum is an implementation of the assembly technique described herein for revising generation from minism.

As shown in fig. 7, minism + Quorum has a significantly less error rate than minism + Racon for each sample of data. As an example, for e.coli from 30X Pacbio data, each iteration of minism + Quorum (represented by the junction points) has an error rate of less than 100 errors per 100 kilobases, while minism + Racon has a minimum error rate of about 200 errors per 100 kilobases. As another example, for E.coli from 30 ONT data, each iteration of Miniasm + Quorum has an error rate of about 400 errors per 100 kilobases, while Miniasm + Racon has an error rate of about 500 errors per 100 kilobases.

As shown in fig. 7, Canu + Quorum provides higher accuracy than results from Canu alone. While Canu includes incorporating conventional error correction techniques, the techniques described herein provide improved assembly generation accuracy. As an example, Canu has an error rate of greater than 500 errors per 100 kilobases for E.coli from 30 ONT data, while each iteration of Canu + Quorum has an error rate of less than 350 errors per 100 kilobases.

As shown in fig. 7, the techniques described herein may provide improved assembly accuracy without increasing the amount of computation time to perform error correction. By way of example, minism + Quorum achieves better accuracy than minism + Racon for substantially the same number of CPU hours. As another example, Canu + Quorum achieves better accuracy than Canu alone without substantially increasing the number of CPU hours to correct the assembly.

In some embodiments, the systems and techniques described herein may be implemented using one or more computing devices. However, embodiments are not limited to operation with any particular type of computing device. By way of further illustration, FIG. 8 is a block diagram of an illustrative computing device 800. Computing device 800 may include one or more processors 802 and one or more tangible, non-transitory computer-readable storage media (e.g., memory 804). The memory 804 may store computer program instructions in a tangible, non-transitory computer recordable medium that, when executed, implement any of the functions described above. The processor 802 may be coupled to the memory 804 and may execute such computer program instructions to cause the functions to be performed and executed.

Computing device 800 may also include network input/output (I/O) interface 806 via which the computing device may communicate with other computing devices (e.g., over a network); and may also include one or more user I/O interfaces 808 via which a computing device can provide output to, and receive input from, a user. The user I/O interface may include devices such as a keyboard, a mouse, a microphone, a display device (e.g., a monitor or touch screen), a speaker, a camera, and/or various other types of I/O devices.

The above-described embodiments may be implemented in any of numerous ways. For example, embodiments may be implemented using hardware, software, or a combination thereof. When implemented in software, the software code can be executed on any suitable processor (e.g., a microprocessor) or collection of processors, which can be provided in a single computing device or distributed among multiple computing devices. It should be understood that any assembly or collection of assemblies that perform the functions described above can be generically considered as one or more controllers that control the above-described functions. The controller(s) can be implemented in numerous ways, such as in dedicated hardware or in general purpose hardware (e.g., one or more processors) that is programmed using microcode or software to perform the functions recited above.

In this regard, it should be appreciated that one implementation of the embodiments described herein includes at least one computer-readable storage medium (e.g., RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other tangible, non-transitory computer-readable storage media) encoded with a computer program (i.e., a plurality of executable instructions) that, when executed on one or more processors, performs the functions described above for one or more embodiments. The computer readable medium may be transportable, such that the program stored thereon can be loaded onto any computing device to implement aspects of the techniques discussed herein. Furthermore, it should be understood that reference to a computer program is not limited to an application running on a host computer, wherein the computer program performs any of the above-described functions when executed. Rather, the terms computer program and software are used herein in a generic sense to refer to any type of computer code (e.g., application software, firmware, microcode, or any other form of computer instructions) that can be employed to program one or more processors to implement aspects of the techniques discussed herein.

The various features and aspects of the present disclosure may be used alone, in combination, or in a variety of arrangements not specifically discussed in the embodiments described in the foregoing and is therefore not limited in its application to the details and arrangement of assembly set forth in the foregoing description or illustrated in the drawings. For example, aspects described in one embodiment may be combined in any manner with aspects described in other embodiments.

The terms "about," "substantially," and "about" may be used to indicate within ± 20% of a target value in some embodiments, within ± 10% of a target value in some embodiments, within ± 5% of a target value in some embodiments, and within ± 2% of a target value in some embodiments. The terms "about" and "approximately" may encompass the target value.

Additionally, the concepts disclosed herein are also embodied as a method, examples of which have been provided herein. The acts performed as part of the method may be sequenced in any suitable manner. Accordingly, embodiments may be constructed which perform acts in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in the illustrated embodiments.

Use of ordinal terms such as "first," "second," "third," etc., in the claims to modify a claimed element does not by itself connote any priority, precedence, or order of one claimed element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claimed element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claimed elements.

Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising," or "having," "containing," "involving," and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.

57页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:健康管理装置、健康管理方法及程序

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!