Method for storing character dot matrix by using DNA character code

文档序号:1127734 发布日期:2020-10-02 浏览:7次 中文

阅读说明:本技术 一种使用dna字符码存储文字点阵的方法 (Method for storing character dot matrix by using DNA character code ) 是由 任兆瑞 于 2020-06-24 设计创作,主要内容包括:本发明公开了一种使用DNA字符码存储文字点阵的方法,方法首先将文字或字符点阵使用一段长度为N个碱基的DNA字符码表示,字符码为N+1的形式,其最后一位为根据其前面的N个碱基统计计算得到的校验码,所得字符码添加到其对应的文字或字符点阵的开头和末尾,并添加到字典中,文字点阵存储则直接使用方法所生成的带有校验码的字符码进行;采用本发明的文字点阵存储方法,不仅能够避免文字点阵DNA存储方法容易产生大段的重复DNA序列的不足,同时,采用了较短长度的DNA字符码取代文字点阵记录文字信息,极大地提高了存储密度。(The invention discloses a method for storing a character lattice by using DNA character codes, which comprises the steps of firstly, expressing the character or character lattice by using a section of DNA character codes with the length of N basic groups, wherein the character codes are in a form of N +1, the last bit of the character or character lattice is a check code obtained by statistical calculation according to the previous N basic groups, the obtained character codes are added to the beginning and the end of the corresponding character or character lattice and are added into a dictionary, and the character lattice storage is directly carried out by using the character codes with the check codes generated by the method; the character lattice storage method of the invention can avoid the defect that the character lattice DNA storage method is easy to generate large-segment repeated DNA sequences, and simultaneously, the character lattice is replaced by the DNA character codes with shorter length to record character information, thereby greatly improving the storage density.)

1. A method for storing a character lattice by using a DNA character code is characterized by comprising the following steps:

step S1, randomly generating a section of DNA basic group with the length of N as character code facing to the lattice of each character or character in the character library;

step S2, generating a check code based on the binary code of the character code in the step S1, and adding the check code to the tail of the character code to form the character code with the check code;

step S3, adding the character code obtained in step S2 to the beginning and end of the corresponding character or character lattice, and adding the character code to a dictionary;

and step S4, storing the character lattice by using the DNA character code obtained in the step S2.

2. The method of claim 1, wherein the check code of step S2 is a DNA base, and the first bit of the binary code of the check code is generated by: counting the number of odd-numbered bits '1' in the binary code of the character code, wherein the number is 1 when the number is odd, and 0 when the number is even;

the method for generating the binary code secondary bits of the check code comprises the following steps: counting the number of even numbered 1 in the binary code of the character code, wherein the number is 1 if the number is odd number and 0 if the number is even number.

3. The method for storing a text lattice using DNA character codes according to claim 1, wherein the character codes generated in the step S2 have a length of 8 to 10 DNA bases.

4. The method for storing a lattice of characters using DNA character codes according to claim 1, wherein a plurality of character codes are generated in step S1 for a lattice of characters or characters frequently used.

Technical Field

The invention discloses a method for storing a character dot matrix by using DNA character codes, belonging to the field of biotechnology and information technology.

Background

DNA is an important genetic material carrier, and is a linear or circular double-helix biological macromolecule generated by billions of years of evolution of organisms in the nature. In the fifties of the last century, DNA, as a genetic information carrier for organisms, was confirmed by scientists to have its own natural advantages in data storage compared to the common information storage carriers: the storage density is high, and 1 g of DNA can store all book contents in the world; the storage is stable and long, and can reach tens of thousands of years to millions of years; convenient carrying, and can be carried by bacteria or organism or stored in a container for a long time.

In the conventional storage medium, paper and the like are mainly represented by graphic characters and the like, information carriers such as optical discs and the like are represented by binary electrical signals 0/1, DNA is realized by sequentially arranging four bases of A/T/G/C, different sequences represent different information and are equivalent to quaternary systems. Generally, in the DNA of an organism, the distribution and proportion of A/T/G/C bases need to be uniform and the repetition of a large segment needs to be reduced as much as possible, when the DNA is stored based on a character lattice, the DNA sequence of each character comprises hundreds of bases, and the lattices of different Chinese characters may contain many same segments, so that more repeated DNA segments are generated; meanwhile, when the DNA is stored based on the character lattice, the same Chinese characters can generate the same DNA sequences, and for some characters or characters with higher occurrence frequency, more repeated DNA sequences can be generated, and the ubiquitous repeated DNA sequences can greatly influence the copying and synthesis of the DNA.

Disclosure of Invention

Technical problem to be solved by the invention

In order to solve the problem that the existing character lattice DNA can generate more repeated sequences when being stored to influence the copying and synthesis of the character lattice DNA, the invention provides a method for storing a character lattice by using DNA character codes.

Technical scheme

In order to solve the technical problems, the invention adopts the following technical scheme:

a method for storing a character lattice by using DNA character codes comprises the following steps:

step 1, randomly generating a section of DNA basic group with the length of N as character code for each character or character lattice in a character library;

step 2, generating a check code for the binary code generating the character code and adding the check code to the tail of the character code to generate the character code with the length of 8-10 DNA basic groups;

step 3, adding the character codes obtained in the previous step to the beginning and the end of the corresponding characters or character lattices, and adding the character codes to the dictionary;

and 4, storing the character lattice by using the DNA character code obtained in the step 2.

Further, the check code in step 2 is a DNA base, and the method for generating the first bit of the binary code of the check code comprises: counting the number of odd bits of '1' in the binary code of the character code, wherein the number of the odd bits is 1, and the number of the even bits is 0; the method for generating the binary code secondary bits of the check code comprises the following steps: counting the number of even numbered 1 in the binary code of the character code, wherein the number is 1 if the number is odd number and 0 if the number is even number.

Further, to avoid repetition, a plurality of character codes are generated for a lattice of frequently used characters or characters.

Advantageous effects

Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:

the storage method can avoid generating large-segment repeated DNA sequences and overcome the defects of a character dot matrix DNA storage method;

the invention adopts the DNA character code with shorter length to record the character information, thereby greatly improving the storage density;

the storage method of the invention has simple principle and steps and is convenient to popularize and apply in the technical field.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention;

FIG. 2 is a diagram illustrating an exemplary check code generation according to the present invention.

Detailed Description

For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings.

FIG. 1 is a step diagram of a method for storing a text lattice by using DNA character codes according to the present invention, in which each text or character in a character library is represented by a DNA sequence of 8-10 bases, and the generated character code is N +1, for example, the first 9 DNA base sequences of the 10 character codes are randomly generated, and the last 1 is a check code generated according to the first 9 bits to check whether the 9 character codes have errors.

The check code can judge whether the character code is changed due to gene mutation during the process of storing the DNA, and generally, the probability of one base error is less than 10 in the process of DNA replication-6If an error base occurs in the 10-bit character code, the check code and the previous 9 bits will not match. The check code is arranged according to 18 bits of ACGT binary coding, odd bits are 1 if odd number of 1, even number of 1 is 0, even number of 1 is 1 if even number of 1, even number of 1 is 0 if odd number of 1, and finally 00, 01, 10 or 11 formed by odd number of bits and even number of bits is converted into ACGT to form the check code. Fig. 2 shows an example of generating the check code for the initial character code, where the 9-bit character code TCTATGTTC generates the check code C, and the final check code for the chinese character is TCTATGTTCC.

In the dictionary, the same character codes are used before and after the DNA sequence of each character or character lattice to represent the beginning and the end of the character or character lattice, and character information is directly stored by using the character codes. In the N +1 character code, the length of N determines the capacity of the character library, e.g. 4-base (ACGT) character codes, which can contain up to 44256 different characters; and 9 bits have 49I.e. over 26 thousand different characters.

When the dot matrix library is used for storage, the dot matrix of each Chinese character comprises hundreds of DNA base sequences, and in each piece of information, the sequences of the same Chinese characters are the same, so that the copying and synthesis of the DNA can be greatly influenced. When the character code is used, the length is reduced to not more than 10 bases, and the occurrence of large repeated sequences can be basically avoided. For frequently occurring words or characters, such as the most frequently used partial words, commas and periods of punctuation, a plurality of different character codes can be assigned, thereby further reducing the repetitive DNA sequences upon storage. Meanwhile, compared with the method of singly using dot matrix characters, the method of recording character information by adopting character codes can improve the storage density by tens of times.

According to statistics, most books use a small number of different characters, and Chinese version Bible (Bible) uses 3100Many different Chinese characters, about 10000 Chinese characters are recorded in Xinhua dictionary, and about 11000 different English words are used in English version Bible. When the method of the present invention is used, taking 10-bit character code (9 + 1) as an example, the total character code capacity is 49About 27 ten thousand, the character code with large capacity can have enough redundancy, and a plurality of lattices and character codes can be adopted for characters or symbols with high use frequency so as to reduce the repeated sequence of repeated DNA. In addition, for most information to be stored, 2 ten thousand character codes are enough, which is equivalent to less than 10% of the total capacity of 27 ten thousand, if random mutation occurs in the DNA sequence of the character codes of the recorded information in the storage process, except that the character codes and the check codes are inconsistent, more than 90% of the possibility that the character dot matrix corresponding to the wrong character code cannot be found in the dot matrix character library dictionary, and further the character code error can be judged, so that the method is convenient for timely correction in the subsequent information reading process.

The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

6页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种DNA数据存储编码方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!