Method and system for improving safety of gene editing technology

文档序号：600278 发布日期：2021-05-04 浏览：11次中文

阅读说明：本技术 一种提高基因编辑技术安全性的方法及系统 (Method and system for improving safety of gene editing technology ) 是由李晓光于 2021-01-19 设计创作，主要内容包括：本发明公开了一种提高基因编辑技术安全性的方法和系统,所述方法包括：获得进行基因编辑的第一外源基因；获得所述第一外源基因导入后的第一细胞基因序列信息；根据所述第一外源基因获得易脱靶的第一基因区域；获得所述第一基因区域内的第二细胞基因序列信息；将所述第二细胞基因序列信息输入第一训练模型中,获得所述第一训练模型的第一输出信息,其中,所述第一输出信息标识所述第二细胞基因序列是否满足预定条件的结果信息；当所述第一输出信息满足所述预定条件时,获得基因编辑技术安全性的第一结果。解决了现有技术中基因编辑技术安全性低、脱靶率高,脱靶预测不够准确的技术问题。(The invention discloses a method and a system for improving the safety of a gene editing technology, wherein the method comprises the following steps: obtaining a first exogenous gene for gene editing; obtaining the gene sequence information of a first cell after the first exogenous gene is introduced; obtaining a first gene region which is easy to miss targets according to the first exogenous gene; obtaining second cell gene sequence information within the first gene region; inputting the second cell gene sequence information into a first training model to obtain first output information of the first training model, wherein the first output information identifies result information of whether the second cell gene sequence meets a preset condition; and when the first output information meets the preset condition, obtaining a first result of the safety of the gene editing technology. The method solves the technical problems of low safety, high off-target rate and inaccurate off-target prediction of the gene editing technology in the prior art.)

1. A method of increasing the safety of gene editing techniques, wherein the method comprises:

obtaining a first exogenous gene for gene editing;

obtaining the gene sequence information of a first cell after the first exogenous gene is introduced;

obtaining a first gene region which is easy to miss targets according to the first exogenous gene;

obtaining second cell gene sequence information in the first gene region according to the first cell gene sequence information and the first gene region;

inputting the second cell gene sequence information into a first training model, wherein the first training model is obtained by training a plurality of groups of training data, and each group of training data in the plurality of groups comprises: second cell gene sequence information, and result information identifying whether the second cell gene sequence satisfies a predetermined condition;

obtaining first output information of the first training model, wherein the first output information identifies result information of whether the second cell gene sequence satisfies a predetermined condition;

and when the first output information meets the preset condition, obtaining a first result of the safety of the gene editing technology.

2. The method of claim 1, wherein when the first output information satisfies the predetermined condition, the method further comprises:

obtaining a second gene region which is not easy to miss targets according to the first exogenous gene;

obtaining third cell gene sequence information in the second gene region according to the first cell gene sequence information and the second gene region;

inputting the third cell gene sequence information into a second training model, wherein the second training model is obtained by training a plurality of groups of training data, and each group of training data in the plurality of groups comprises: third cell gene sequence information, and result information identifying whether the third cell gene sequence information satisfies the predetermined condition;

obtaining second output information of the second training model, wherein the second output information identifies result information of whether the third cell gene sequence satisfies the predetermined condition;

and when the second output information shows that the preset condition is met, obtaining a second result of the safety of the gene editing technology.

3. The method of claim 1, wherein after obtaining gene sequence information of the first cell after introduction of the first exogenous gene, the method further comprises:

inputting the first cell gene sequence information into a first predictive model;

obtaining first potential off-target site information;

inputting the first potential miss site information into a first comparative model.

4. The method of claim 3, wherein after obtaining second cellular gene sequence information within the first gene region, the method further comprises:

inputting the second cellular gene sequence information to the first predictive model;

obtaining second potential off-target site information;

inputting the second potential off-target site information into the first alignment model;

outputting a first comparison result by the first comparison model;

and determining the first gene region according to the first comparison result.

5. The method of claim 1, wherein the method comprises:

after obtaining the first exogenous gene for gene editing, the method further comprises:

obtaining first species information of the first exogenous gene;

obtaining first reference group gene information of the first exogenous gene according to the first variety information;

obtaining first target information according to the first reference group gene information;

and introducing the first exogenous gene according to the first target information.

6. The method of claim 1, wherein the method comprises:

generating a first verification code according to the first cell gene sequence information, wherein the first verification code corresponds to the first cell gene sequence information;

generating a second verification code according to the second cell gene sequence information and the first verification code; by analogy, generating an Nth verification code according to the Nth cell gene sequence information and the Nth-1 verification code, wherein N is a natural number greater than 1;

and respectively taking the cell gene sequence information and the corresponding verification codes as a storage unit, and respectively copying and storing the storage units on M devices, wherein M is a natural number more than 1.

7. The method of claim 6, wherein the method comprises:

taking the Nth cell gene sequence information and the Nth verification code as an Nth storage unit;

obtaining the recording time of the Nth storage unit, wherein the recording time of the Nth storage unit represents the time required to be recorded by the Nth storage unit;

acquiring first equipment with the largest memory in the M equipment according to the recording time of the Nth storage unit;

and sending the recording right of the Nth storage unit to the first equipment.

8. A system for improving the safety of gene editing techniques, wherein the system comprises:

a first obtaining unit for obtaining a first foreign gene for gene editing;

a second obtaining unit for obtaining gene sequence information of the first cell after the first foreign gene is introduced;

a third obtaining unit for obtaining a first gene region which is easy to miss from the first foreign gene;

a fourth obtaining unit configured to obtain second cell gene sequence information within the first gene region based on the first cell gene sequence information and the first gene region;

a first input unit, configured to input the second cell gene sequence information into a first training model, where the first training model is obtained by training multiple sets of training data, and each set of training data in the multiple sets includes: second cell gene sequence information, and result information identifying whether the second cell gene sequence satisfies a predetermined condition;

a fifth obtaining unit, configured to obtain first output information of the first training model, where the first output information identifies result information of whether the second cell gene sequence satisfies a predetermined condition;

a sixth obtaining unit configured to obtain a first result of safety of a gene editing technique when the first output information satisfies the predetermined condition.

9. A system for improving the safety of gene editing techniques, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the program.

Technical Field

The invention relates to the technical field of gene editing, in particular to a method and a system for improving the safety of a gene editing technology.

Background

Gene editing, also known as genome editing or genome engineering, is an emerging and relatively precise genetic engineering technique or process capable of modifying a specific target gene in the genome of an organism. However, the off-target effect is still a main limiting factor influencing whether the gene editing technology can be widely applied, how to correctly evaluate and detect the off-target effect and provide a corresponding strategy to reduce the off-target effect is an important research direction in the current gene editing research field.

In the process of implementing the technical scheme of the invention in the embodiment of the present application, the inventor of the present application finds that the above-mentioned technology has at least the following technical problems:

the gene editing technology has low safety, high off-target rate and inaccurate off-target prediction.

Disclosure of Invention

The embodiment of the application provides a method for improving the safety of a gene editing technology, wherein the method comprises the following steps: obtaining a first exogenous gene for gene editing; obtaining the gene sequence information of a first cell after the first exogenous gene is introduced; obtaining a first gene region which is easy to miss targets according to the first exogenous gene; obtaining second cell gene sequence information in the first gene region according to the first cell gene sequence information and the first gene region; inputting the second cell gene sequence information into a first training model, wherein the first training model is obtained by training a plurality of groups of training data, and each group of training data in the plurality of groups comprises: second cell gene sequence information, and result information identifying whether the second cell gene sequence satisfies a predetermined condition; obtaining first output information of the first training model, wherein the first output information identifies result information of whether the second cell gene sequence satisfies a predetermined condition; and when the first output information meets the preset condition, obtaining a first result of the safety of the gene editing technology.

In another aspect, the present application further provides a system for improving safety of gene editing technology, wherein the system comprises: a first obtaining unit for obtaining a first foreign gene for gene editing; a second obtaining unit for obtaining gene sequence information of the first cell after the first foreign gene is introduced; a third obtaining unit for obtaining a first gene region which is easy to miss from the first foreign gene; a fourth obtaining unit configured to obtain second cell gene sequence information within the first gene region based on the first cell gene sequence information and the first gene region; a first input unit, configured to input the second cell gene sequence information into a first training model, where the first training model is obtained by training multiple sets of training data, and each set of training data in the multiple sets includes: second cell gene sequence information, and result information identifying whether the second cell gene sequence satisfies a predetermined condition; a fifth obtaining unit, configured to obtain first output information of the first training model, where the first output information identifies result information of whether the second cell gene sequence satisfies a predetermined condition; a sixth obtaining unit configured to obtain a first result of safety of a gene editing technique when the first output information satisfies the predetermined condition.

On the other hand, the embodiment of the present application further provides a system for improving the safety of the gene editing technology, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method according to the first aspect when executing the program.

One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:

the gene sequence edited by the gene is input into the training model, so that the gene sequence of the target area easy to miss is detected, the target miss phenomenon possibly occurring is predicted, and the characteristics that the training model can continuously learn and acquire experience to process data are adopted, so that the edited genome can be accurately compared, and the technical purposes of improving the accuracy of target miss detection and improving the safety of the gene editing technology are realized.

The foregoing is a summary of the present disclosure, and embodiments of the present disclosure are described below to make the technical means of the present disclosure more clearly understood.

Drawings

FIG. 1 is a schematic flow chart of a method for improving the safety of a gene editing technology according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a system for improving the safety of a gene editing technique according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an exemplary electronic device according to an embodiment of the present application.

Description of reference numerals: a first obtaining unit 11, a second obtaining unit 12, a third obtaining unit 13, a fourth obtaining unit 14, a first input unit 15, a fifth obtaining unit 16, a sixth obtaining unit 17, a bus 300, a receiver 301, a processor 302, a transmitter 303, a memory 304, and a bus interface 305.

Detailed Description

The embodiment of the application provides a method and a system for improving the safety of a gene editing technology, solves the technical problems that the safety of the gene editing technology is low, the off-target rate is high and the off-target prediction is not accurate enough in the prior art, and achieves the technical aims of carrying out genome comparison based on a training model, thereby improving the accuracy of off-target detection and improving the safety of the gene editing technology. Hereinafter, example embodiments of the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are merely some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited to the example embodiments described herein.

Summary of the application

The off-target effect is still a main limiting factor influencing whether the gene editing technology can be widely applied, how to correctly evaluate and detect the off-target effect and provide a corresponding strategy to reduce the off-target effect, and the method is an important research direction in the current gene editing research field and has the technical problems of low safety of the gene editing technology, high off-target rate and inaccurate off-target prediction in the prior art.

In view of the above technical problems, the technical solution provided by the present application has the following general idea:

the embodiment of the application provides a method for improving the safety of a gene editing technology, wherein the method comprises the following steps: obtaining a first exogenous gene for gene editing; obtaining the gene sequence information of a first cell after the first exogenous gene is introduced; obtaining a first gene region which is easy to miss targets according to the first exogenous gene; obtaining second cell gene sequence information in the first gene region according to the first cell gene sequence information and the first gene region; inputting the second cell gene sequence information into a first training model, wherein the first training model is obtained by training a plurality of groups of training data, and each group of training data in the plurality of groups comprises: second cell gene sequence information, and result information identifying whether the second cell gene sequence satisfies a predetermined condition; obtaining first output information of the first training model, wherein the first output information identifies result information of whether the second cell gene sequence satisfies a predetermined condition; and when the first output information meets the preset condition, obtaining a first result of the safety of the gene editing technology.

Having thus described the general principles of the present application, various non-limiting embodiments thereof will now be described in detail with reference to the accompanying drawings.

Example one

As shown in fig. 1, the present application provides a method for improving the safety of a gene editing technology, wherein the method includes:

step S100: obtaining a first exogenous gene for gene editing;

specifically, gene editing relies on genetically engineered nucleases, also known as "molecular scissors," to generate site-specific double-strand breaks (DSBs) at specific locations in the genome, inducing organisms to repair DSBs by non-homologous end joining (NHEJ) or Homologous Recombination (HR), which is a targeted mutation that is prone to errors. The first exogenous gene is a guide sequence for gene editing of a DNA sequence.

Step S200: obtaining the gene sequence information of a first cell after the first exogenous gene is introduced;

specifically, after the first exogenous gene is introduced into an original gene sequence under the action of nuclease, the gene sequence information of the first cell is obtained, and a foundation is laid for predicting the off-target effect.

Step S300: obtaining a first gene region which is easy to miss targets according to the first exogenous gene;

specifically, the first gene region is a region which is easy to generate a miss-target phenomenon in the first cellular gene sequence, the first gene region is determined by predicting a potential miss-target site in the first cellular gene sequence, and then the gene sequence in the first gene region is analyzed and detected.

Step S400: obtaining second cell gene sequence information in the first gene region according to the first cell gene sequence information and the first gene region;

specifically, after the first gene region is determined, the gene sequence information within the first gene region is detected, thereby achieving the evaluation of the safety of the entire gene sequence by evaluating the safety of gene editing within the first gene region.

Step S500: inputting the second cell gene sequence information into a first training model, wherein the first training model is obtained by training a plurality of groups of training data, and each group of training data in the plurality of groups comprises: second cell gene sequence information, and result information identifying whether the second cell gene sequence satisfies a predetermined condition;

step S600: obtaining first output information of the first training model, wherein the first output information identifies result information of whether the second cell gene sequence satisfies a predetermined condition;

specifically, the first training model is a machine learning model, and the machine learning model can continuously learn through a large amount of data, further continuously modify the model, and finally obtain satisfactory experience to process other data. The machine model is obtained by training a plurality of groups of training data, and the process of training the neural network model by the training data is essentially a process of supervised learning. Each set of training data in the plurality of sets of training data comprises: second cell gene sequence information, and result information identifying whether the second cell gene sequence satisfies a predetermined condition; under the condition of obtaining the second cell gene sequence information, the machine learning model outputs identification information of result information of whether the second cell gene sequence meets a preset condition, the result information of whether the second cell gene sequence meets the preset condition output by the machine learning model is verified through the identification information of whether the second cell gene sequence meets the preset condition, if the output result information of whether the second cell gene sequence meets the preset condition is consistent with the identification result information of whether the second cell gene sequence meets the preset condition, the data supervised learning is finished, and then the next group of data learning is carried out; and if the output result information of whether the second cell gene sequence meets the preset condition is inconsistent with the identified result information of whether the second cell gene sequence meets the preset condition, adjusting the machine learning model by the machine learning model, and performing supervised learning of the next group of data after the machine learning model reaches the expected accuracy. And continuously correcting and optimizing the machine learning model through training data, and improving the accuracy of the machine learning model in processing the data through a supervised learning process so as to obtain more accurate result information whether the second cell gene sequence meets the preset conditions.

Step S700: and when the first output information meets the preset condition, obtaining a first result of the safety of the gene editing technology.

Specifically, if the second cell gene sequence obtained from the first output information satisfies the predetermined condition, the predetermined condition is whether the second cell gene sequence has off-target effect, and if the second cell gene sequence is identical to the original gene sequence through genome detection, the second cell gene sequence satisfies the predetermined condition, thereby obtaining the evaluation information of the first result of the safety of the gene editing technology.

Further, step S700 in the embodiment of the present application further includes:

step S701: obtaining a second gene region which is not easy to miss targets according to the first exogenous gene;

step S702: obtaining third cell gene sequence information in the second gene region according to the first cell gene sequence information and the second gene region;

step S703: inputting the third cell gene sequence information into a second training model, wherein the second training model is obtained by training a plurality of groups of training data, and each group of training data in the plurality of groups comprises: third cell gene sequence information, and result information identifying whether the third cell gene sequence information satisfies the predetermined condition;

step S704: obtaining second output information of the second training model, wherein the second output information identifies result information whether the third cell gene sequence satisfies a predetermined condition;

step S705: and obtaining a second result of the safety of the gene editing technology when the second output information shows that the predetermined condition is met.

Specifically, the second gene region is a region that is not easy to miss in the first cell gene sequence information, the gene sequence information of the second gene region, that is, the third cell gene sequence information is obtained, and the third cell gene sequence information is input into the second training model, the second machine model is obtained by training multiple sets of training data, and the process of training the neural network model by the training data is essentially a process of supervised learning. Each set of training data in the plurality of sets of training data comprises: third cell gene sequence information, and result information identifying whether the third cell gene sequence information satisfies a predetermined condition; under the condition of obtaining the third cell gene sequence information, the machine learning model outputs identification information indicating whether the third cell gene sequence information meets the condition, the result information indicating whether the third cell gene sequence information meets the predetermined condition or not output by the machine learning model is used for verifying whether the third cell gene sequence information meets the predetermined condition or not, and if the output result information indicating whether the third cell gene sequence information meets the predetermined condition is consistent with the result information indicating whether the third cell gene sequence information meets the predetermined condition or not, the data supervised learning is finished, and then the next group of data learning supervision is carried out; and if the output result information of whether the third cell gene sequence information meets the preset condition is inconsistent with the identified result information of whether the third cell gene sequence information meets the preset condition, adjusting the machine learning model by the machine learning model, and performing supervised learning of the next group of data until the machine learning model reaches the expected accuracy. And if the third cell gene sequence is the same as the original gene sequence through genome detection, the third cell gene sequence meets the preset condition, so that the evaluation information of a second result of the safety of the gene editing technology is obtained.

Further, step S200 in the embodiment of the present application further includes:

step S201: inputting the first cell gene sequence information into a first predictive model;

step S202: obtaining first potential off-target site information;

step S203: inputting the first potential miss site information into a first comparative model.

Specifically, one of the simplest and most effective methods for off-target detection is whole genome sequencing. And predicting potential off-target sites in the first cell gene sequence information through the first prediction model, and inputting the first potential off-target site information into the first alignment model to be aligned with the original genome.

Further, step S201 in the embodiment of the present application further includes:

step S2011: inputting the second cellular gene sequence information to the first predictive model;

step S2012: obtaining second potential off-target site information;

step S2013: inputting the second potential off-target site information into the first alignment model;

step S2014: outputting a first comparison result by the first comparison model;

step S2015: and determining the first gene region according to the first comparison result.

Specifically, the potential off-target site information of the second cell gene sequence information is obtained, the first alignment model is used for performing genome information alignment with the first potential off-target site, if the potential off-target site is not changed, no off-target is proved, and if the potential off-target site is changed, off-target is possible, and then the first comparison result is used for determining the easy-off-target region in the first cell gene sequence information, namely the first gene region.

Further, step S100 in the embodiment of the present application further includes:

step S101: after obtaining the first exogenous gene for gene editing, the method further comprises:

step S102: obtaining first species information of the first exogenous gene;

step S103: obtaining first reference group gene information of the first exogenous gene according to the first variety information;

step S104: obtaining first target information according to the first reference group gene information;

step S105: and introducing the first exogenous gene according to the first target information.

In particular, even in the same species, there are considerable genetic differences between different varieties and lines, which can have great influence on off-target and gene editing accuracy. Therefore, when genome editing of each variety is carried out, the reference genome of the variety is used for target design, so that the safety of gene editing can be effectively improved, and the off-target phenomenon can be reduced. And obtaining the first target information by determining the variety of the first exogenous gene and then obtaining a reference genome according to the first variety information, thereby introducing the first exogenous gene according to the first target information.

Further, step S400 in the embodiment of the present application further includes:

step S401: generating a first verification code according to the first cell gene sequence information, wherein the first verification code corresponds to the first cell gene sequence information;

step S402: generating a second verification code according to the second cell gene sequence information and the first verification code; by analogy, generating an Nth verification code according to the Nth cell gene sequence information and the Nth-1 verification code, wherein N is a natural number greater than 1;

step S403: and respectively taking the cell gene sequence information and the corresponding verification codes as a storage unit, and respectively copying and storing the storage units on M devices, wherein M is a natural number more than 1.

Specifically, in order to ensure the safety of gene sequence information, a first verification code is generated according to the first cell gene sequence information, wherein the first verification code and the first cell gene sequence information are in one-to-one correspondence; and generating a second verification code … according to the second cell gene sequence information and the first verification code, and so on, using the first cell gene sequence information and the first verification code as a first storage unit, using the second cell gene sequence information and the second verification code as a second storage unit …, and so on, and obtaining N storage units in total. The verification code information is used as main body identification information, and the identification information of the main body is used for distinguishing from other main bodies. When the cell gene sequence information needs to be called, after each next node receives the data stored by the previous node, the data is verified through a 'consensus mechanism' and then stored, and each storage unit is connected in series through a Hash technology, so that the cell gene sequence information is not easy to lose and damage, and the safety of gene sequence information storage in a gene editing process is ensured through a data information processing technology based on a block chain.

Further, step S402 in the embodiment of the present application further includes:

step S4021: taking the Nth cell gene sequence information and the Nth verification code as an Nth storage unit;

step S4022: obtaining the recording time of the Nth storage unit, wherein the recording time of the Nth storage unit represents the time required to be recorded by the Nth storage unit;

step S4023: acquiring first equipment with the largest memory in the M equipment according to the recording time of the Nth storage unit;

step S4024: and sending the recording right of the Nth storage unit to the first equipment.

Specifically, the nth cell gene sequence information and the nth verification code are partitioned to generate a plurality of blocks, and the nth device node is added to a block chain after the blocks are identified. And the Nth storage unit records time which is used for verifying the verification by a 'consensus mechanism' based on the obtained Nth verification code information and the Nth cell gene sequence information, storing the verification after the verification is passed and adding the verification to the original block. The shorter the recording time of the Nth storage unit is, the fastest the transport capacity of the equipment node is. The equipment with the fastest transport capacity is selected as block recording equipment, so that the real-time performance of data interaction under the chain in the block chain is improved, the safe, effective and stable operation of a decentralized block chain system is guaranteed, the efficiency of block chain message processing is improved, and the technical effect of improving the cell gene sequence information storage safety is achieved.

In summary, the method for improving the safety of the gene editing technology provided by the embodiment of the present application has the following technical effects:

Example two

Based on the same inventive concept as the method for improving the safety of the gene editing technology in the previous embodiment, the present invention further provides a system for improving the safety of the gene editing technology, as shown in fig. 2, the system comprises:

a first obtaining unit 11, wherein the first obtaining unit 11 is used for obtaining a first exogenous gene for gene editing;

a second obtaining unit 12, wherein the second obtaining unit 12 is used for obtaining the gene sequence information of the first cell after the first exogenous gene is introduced;

a third obtaining unit 13, wherein the third obtaining unit 13 is used for obtaining a first gene region easy to miss target according to the first exogenous gene;

a fourth obtaining unit 14, wherein the fourth obtaining unit 14 is configured to obtain second cell gene sequence information in the first gene region according to the first cell gene sequence information and the first gene region;

a first input unit 15, where the first input unit 15 is configured to input the second cell gene sequence information into a first training model, where the first training model is obtained by training multiple sets of training data, and each set of training data in the multiple sets includes: second cell gene sequence information, and result information identifying whether the second cell gene sequence satisfies a condition;

a fifth obtaining unit 16, where the fifth obtaining unit 16 is configured to obtain first output information of the first training model, where the first output information identifies result information of whether the second cell gene sequence satisfies a predetermined condition;

a sixth obtaining unit 17, where the sixth obtaining unit 17 is configured to obtain a first result of the safety of the gene editing technology when the first output information satisfies the predetermined condition.