Encoding method, decoding method, apparatus and computer readable storage medium

文档序号:661714 发布日期:2021-04-27 浏览:6次 中文

阅读说明:本技术 编码方法、解码方法、装置及计算机可读存储介质 (Encoding method, decoding method, apparatus and computer readable storage medium ) 是由 戴俊彪 李敏 王洋 黄小罗 于 2020-12-11 设计创作,主要内容包括:本申请公开了一种编码方法、解码方法、装置及计算机可读存储介质,属于存储技术领域。所述方法包括:确定二进制数据对应的字符序列中出现的多种字符和每种字符的频率;根据多种字符的频率和预设完全三叉树中各个叶子结点的三进制编码,构建第一编码表,预设完全三叉树为包含2~n个叶子结点的完全三叉树,第一编码表用于指示多种字符中每种字符对应的三进制编码;根据第一编码表对该字符序列进行编码,得到目标三进制数据;根据预设对应关系对目标三进制数据进行编码,确定目标碱基序列。本申请可以利用预设的固定规则的完全三叉树构建编码表进行编码,无需再基于字符频率来构建带权路径长度最小的哈夫曼树,编码规则较为简单,提高了编码效率。(The application discloses an encoding method, a decoding method, a device and a computer readable storage medium, and belongs to the technical field of storage. The method comprises the following steps: determining a plurality of characters appearing in a character sequence corresponding to the binary data and the frequency of each character; constructing a first coding table according to the frequencies of various characters and the ternary codes of all leaf nodes in a preset complete ternary tree, wherein the preset complete ternary tree comprises 2 n A complete ternary tree of leaf nodes, the first code table for indicating ternary codes corresponding to each of the plurality of characters; coding the character sequence according to a first coding table to obtain target ternary data; and coding the target ternary data according to the preset corresponding relation to determine the target base sequence. The method and the device can use the complete ternary tree of the preset fixed rule to construct the coding table for coding, and do not need to construct the Huffman tree with the minimum weighted path length based on the character frequency, so that the coding rule is simpler, and the coding efficiency is improved.)

1. A method of encoding, the method comprising:

determining a character sequence corresponding to binary data to be coded;

determining a plurality of characters appearing in the character sequence and a frequency of each character, the frequency of each character being used to indicate the number of times each character appears in the character sequence;

constructing a first coding table according to the frequencies of the characters and the ternary codes of all leaf nodes in the preset complete ternary tree; wherein the predetermined full ternary tree is comprised of 2nA full ternary tree of the leaf nodes, wherein paths between a parent node and a child node in the preset full ternary tree are sequentially encoded to be 0,1 and 2, the first encoding table is used for indicating ternary codes corresponding to each character in the plurality of characters, and n is the bit number of an encoding unit of the binary data;

coding the character sequence according to the first coding table to obtain target ternary data;

and coding the target ternary data according to a preset corresponding relation, and determining a target base sequence according to a coding result, wherein the preset corresponding relation is used for indicating the corresponding relation between the ternary data and the base pairs.

2. The method of claim 1, wherein constructing the first coding table according to the frequencies of the plurality of characters and the ternary codes of the leaf nodes in the predetermined full ternary tree comprises:

sequencing the characters according to the sequence of the frequencies of the characters from large to small to obtain the sequenced characters;

according to the sequence of the code length from short to long, determining a plurality of ternary codes which are ordered at the top from the ternary codes of each leaf node in the preset complete ternary tree, wherein the number of the ternary codes is the same as that of the characters;

sequencing the plurality of ternary codes according to the sequence of the code length from short to long to obtain the sequenced plurality of ternary codes;

establishing a one-to-one correspondence relationship between the sorted characters and the sorted ternary codes to obtain the first code table; the larger the frequency of the characters in the first coding table is, the shorter the ternary codes corresponding to the characters are.

3. The method of claim 1, wherein constructing the first coding table according to the frequencies of the plurality of characters and the ternary codes of the leaf nodes in the predetermined full ternary tree comprises:

sequentially numbering each leaf node in the preset complete ternary tree according to the length sequence of the coding length corresponding to the ternary codes to obtain the serial number of each leaf node;

establishing a corresponding relation between the sequence number of each leaf node and the ternary code to obtain a second code table;

and constructing the first coding table according to the frequencies of the characters and the second coding table.

4. The method of claim 1, wherein before constructing the first coding table according to the frequencies of the plurality of characters and the ternary codes of the leaf nodes in the predetermined full ternary tree, the method further comprises:

construction of a column containing 2nA full ternary tree of leaf nodes;

and sequentially coding the paths between the father node and the child nodes in the complete ternary tree into 0,1 and 2 to obtain the preset complete ternary tree, wherein the ternary code of each leaf node in the preset complete ternary tree is the path code from the root node to each child node.

5. The method of claim 1, wherein the predetermined correspondence includes a plurality of two-bit ternary data and a base pair corresponding to each two-bit ternary data.

6. The method of claim 5, wherein the preset correspondence comprises at least one of the following correspondences:

00 for adenine A, thymine T;

11 corresponds to a cytosine C;

22 corresponds to A guanine G;

01 corresponds to GC;

02 corresponds to GT;

10 corresponds to CT;

20 corresponds to GA;

12 corresponds to GG;

21 corresponds to TG.

7. The method of claim 1, wherein the encoding the ternary data of interest according to a predetermined correspondence and determining the base sequence of interest according to the encoding result comprises:

coding the target ternary data according to the preset corresponding relation to obtain a first base sequence;

replacing an initiation codon appearing in the first base sequence to obtain a second base sequence, wherein the second base sequence does not contain the initiation codon;

and determining the target base sequence according to the second base sequence.

8. The method of claim 7, wherein the initiation codon is a three-position specified base sequence;

the replacing of the initiation codon present in the first base sequence comprises:

and according to a preset replacement rule, performing base replacement on the last two bases in the two base pairs with the initiation codon in the first base sequence until the first base sequence after the base replacement does not contain the initiation codon.

9. The method of claim 8, wherein the preset replacement rule comprises at least one of the following replacement rules:

replacing GT with AA;

replacing GA with TT;

GG is replaced by CC;

GC was replaced by TA;

GT was replaced with CA.

10. The method of claim 7, wherein said determining the target base sequence from the second base sequence comprises:

performing base substitution on the second base sequence according to a preset substitution target and a preset substitution rule, wherein the preset substitution target comprises at least one of GC content requirements and GC distribution requirements;

and determining a second base sequence which meets the preset replacement target and is subjected to base replacement as the target base sequence.

11. The method according to any one of claims 1 to 10, wherein before encoding the target ternary data according to the preset correspondence, the method further comprises:

converting the number of characters into ternary data to obtain first data, and converting the characters in the sequence from large frequency to small frequency into ternary data to obtain second data, wherein the number of the characters refers to the number of the characters;

splicing the first data, the second data and the target ternary data to obtain spliced ternary data;

the encoding the target ternary data according to the preset corresponding relation comprises the following steps:

and coding the spliced ternary data according to the preset corresponding relation.

12. The method according to any one of claims 1 to 10, wherein the encoding the target ternary data according to the predetermined correspondence, and after determining the target base sequence based on the encoding result, further comprises:

carrying out segmentation treatment on the target base sequence to obtain a plurality of first segmented base sequences;

adding segmentation information to each first segmentation base sequence in the plurality of first segmentation base sequences to obtain a plurality of second segmentation base sequences, wherein the segmentation information comprises at least one of a primer, address information and an error correction code;

generating a plurality of deoxyribonucleic acid DNAs based on the plurality of second segment base sequences.

13. A method of decoding, the method comprising:

acquiring a target base sequence to be decoded;

decoding the target base sequence according to a preset corresponding relation, and determining target ternary data according to a decoding result, wherein the preset corresponding relation is used for indicating the corresponding relation between the ternary data and base pairs;

decoding the target ternary data according to ternary codes of all leaf nodes in a preset complete ternary tree to obtain a character sequence, wherein the preset complete ternary tree comprises 2nA complete ternary tree of each leaf node, wherein paths between a father node and a son node in the preset complete ternary tree are sequentially encoded into 0,1 and 2, and n is the bit number of an encoding unit of the binary data;

and determining binary data corresponding to the character sequence.

14. The method of claim 13, wherein the obtaining of the target base sequence to be decoded comprises:

respectively obtaining a second segmented base sequence based on each DNA of the plurality of DNAs to obtain a plurality of second segmented base sequences, wherein each second segmented base sequence comprises a first segmented base sequence and segmentation information, and the segmentation information comprises at least one of a primer, address information and an error correction code;

and acquiring a first segmented base sequence in the plurality of second segmented base sequences according to the segmentation information in the plurality of second segmented base sequences, and performing splicing treatment on the first segmented base sequence in the plurality of second segmented base sequences to obtain the target base sequence.

15. The method of claim 13, wherein the decoding the target base sequence according to the predetermined correspondence and determining target ternary data based on the decoding result comprises:

decoding the target base sequence according to the preset corresponding relation to obtain spliced ternary data, wherein the spliced ternary data comprises first data, second data and the target ternary data, the first data is obtained by carrying out ternary conversion on the number of characters, and the second data is obtained by carrying out ternary conversion on a plurality of characters according to the sequence of the frequencies of the characters from large to small;

and acquiring the target ternary data from the spliced ternary data.

16. The method of claim 15, wherein before decoding the target ternary data according to the ternary codes of the leaf nodes in the preset full ternary tree, the method further comprises:

acquiring the first data and the second data from the spliced ternary data;

according to the first data and the second data, the number of the characters and various characters which are sequenced from large to small according to the frequency of the characters are respectively determined;

the decoding the target ternary data according to the ternary codes of all leaf nodes in the preset complete ternary tree to obtain the character sequence comprises the following steps:

constructing a first coding table according to a plurality of characters which are ordered according to the frequency of the characters from large to small and the ternary codes of all leaf nodes in the preset complete ternary tree, wherein the first coding table is used for indicating the ternary codes corresponding to each character in the plurality of characters;

and coding the target ternary data according to the first coding table to obtain the character sequence.

17. The method according to any one of claims 13 to 16, wherein before decoding the target base sequence according to the predetermined correspondence, the method further comprises:

carrying out base substitution on the target base sequence according to a reverse substitution rule of a preset substitution rule to obtain a first base sequence;

the decoding the target base sequence according to the preset corresponding relation comprises:

and decoding the first base sequence according to the preset corresponding relation.

18. The method of claim 17, wherein the preset replacement rule comprises at least one of the following replacement rules:

replacing GT with AA;

replacing GA with TT;

GG is replaced by CC;

GC was replaced by TA;

GT was replaced with CA.

19. An encoding apparatus, characterized in that the apparatus comprises:

the first determining module is used for determining a character sequence corresponding to binary data to be coded;

a second determining module, configured to determine a plurality of characters appearing in the character sequence and a frequency of each character, where the frequency of each character is used to indicate a number of times that each character appears in the character sequence;

the first construction module is used for constructing a first coding table according to the frequencies of the characters and the ternary codes of all leaf nodes in the preset complete ternary tree; wherein the predetermined full ternary tree is comprised of 2nA full ternary tree of the leaf nodes, wherein paths between a parent node and a child node in the preset full ternary tree are sequentially encoded to be 0,1 and 2, the first encoding table is used for indicating ternary codes corresponding to each character in the plurality of characters, and n is the bit number of an encoding unit of the binary data;

the first coding module is used for coding the character sequence according to the first coding table to obtain target ternary data;

and the second coding module is used for coding the target ternary data according to a preset corresponding relation and determining a target base sequence based on a coding result, wherein the preset corresponding relation is used for indicating the corresponding relation between the ternary data and the base pairs.

20. An apparatus for decoding, the apparatus comprising:

the acquisition module is used for acquiring a target base sequence to be decoded;

a first decoding module, configured to decode the target base sequence according to a preset correspondence, and determine target ternary data based on a decoding result, where the preset correspondence is used to indicate a correspondence between ternary data and base pairs;

a second decoding module, configured to decode the target ternary data according to a ternary code of each leaf node in a preset complete ternary tree to obtain a character sequence, where the preset complete ternary tree includes 2nA complete ternary tree of leaf nodes, and a parent node in the preset complete ternary treeThe paths between the sub-nodes are sequentially encoded into 0,1 and 2, and n is the bit number of the coding unit of the binary data;

and the first determining module is used for determining binary data corresponding to the character sequence.

21. An encoding apparatus, characterized in that the apparatus comprises a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-12 when executing the computer program.

22. Decoding apparatus, characterized in that the apparatus comprises a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 13-18 when executing the computer program.

23. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-12 or claims 13-18.

Technical Field

The present application relates to the field of storage technologies, and in particular, to an encoding method, a decoding method, and a computer-readable storage medium.

Background

The DNA (deoxyribose nucleic Acid) storage technology refers to a technology for storing data using artificially synthesized DNA as a storage medium. In the DNA storage technology, for binary data to be stored, it is necessary to encode the binary data into a base sequence first, and then artificially synthesize DNA based on the encoded base sequence to store the binary data in the DNA. However, when binary data is directly encoded into a base sequence, a long repetitive base sequence is likely to appear in the encoded base sequence, which is disadvantageous for artificial synthesis of DNA.

In the related art, in order to avoid the problem that the base sequence directly encoded by binary data is not favorable for artificially synthesizing DNA, the binary data may be encoded into ternary data, and then the ternary data may be encoded into the base sequence. When encoding binary data into ternary data, in order to reduce the encoding length of the ternary data, a general huffman encoding method may be used to encode the binary data into the ternary data. In a general huffman coding mode, a plurality of characters appearing in a character sequence corresponding to binary data are all used as nodes of a huffman tree, and the frequency of each character is used as a weight value of the corresponding node to construct the huffman tree with the smallest weighted path length. The weighted path length of each node is the product of the path length from the root node to the node and the weight of the node. And after the Huffman tree with the minimum weighted path length is obtained, constructing an encoding table according to the ternary encoding of each node in the Huffman tree, wherein the encoding table is used for indicating the ternary encoding corresponding to each character in the multiple characters. Then, according to the coding table, the character sequence corresponding to the binary data is coded to obtain ternary data, and the ternary data is coded into a base sequence, so as to obtain a target base sequence corresponding to the binary data.

However, in the above coding method, the process of constructing the huffman tree with the smallest weighted path length according to the frequency of the character is complicated, so that the coding rule is complicated and the coding efficiency is low.

Disclosure of Invention

The embodiment of the application provides an encoding method, a decoding method, a device and a computer readable storage medium, which can be used for solving the problems of complex encoding rule and low encoding efficiency in the related technology. The technical scheme is as follows:

in a first aspect, an encoding method is provided, and the method includes:

determining a character sequence corresponding to binary data to be coded;

determining a plurality of characters appearing in the character sequence and a frequency of each character, the frequency of each character being used to indicate the number of times each character appears in the character sequence;

constructing a first coding table according to the frequencies of the characters and the ternary codes of all leaf nodes in the preset complete ternary tree; wherein the predetermined full ternary tree is comprised of 2nA full ternary tree of the leaf nodes, wherein paths between a parent node and a child node in the preset full ternary tree are sequentially encoded to be 0,1 and 2, the first encoding table is used for indicating ternary codes corresponding to each character in the plurality of characters, and n is the bit number of an encoding unit of the binary data;

coding the character sequence according to the first coding table to obtain target ternary data;

and coding the target ternary data according to a preset corresponding relation, and determining a target base sequence according to a coding result, wherein the preset corresponding relation is used for indicating the corresponding relation between the ternary data and the base pairs.

Optionally, the constructing a first coding table according to the frequencies of the plurality of characters and the ternary codes of the leaf nodes in the preset full ternary tree includes:

sequencing the characters according to the sequence of the frequencies of the characters from large to small to obtain the sequenced characters;

according to the sequence of the code length from short to long, determining a plurality of ternary codes which are ordered at the top from the ternary codes of each leaf node in the preset complete ternary tree, wherein the number of the ternary codes is the same as that of the characters;

sequencing the plurality of ternary codes according to the sequence of the code length from short to long to obtain the sequenced plurality of ternary codes;

establishing a one-to-one correspondence relationship between the sorted characters and the sorted ternary codes to obtain the first code table; the larger the frequency of the characters in the first coding table is, the shorter the ternary codes corresponding to the characters are.

Optionally, the constructing a first coding table according to the frequencies of the plurality of characters and the ternary codes of the leaf nodes in the preset full ternary tree includes:

sequentially numbering each leaf node in the preset complete ternary tree according to the coding length of the corresponding ternary code to obtain the serial number of each leaf node;

establishing a corresponding relation between the sequence number of each leaf node and the ternary code to obtain a second code table;

and constructing the first coding table according to the frequencies of the characters and the second coding table.

Optionally, the determining a character sequence corresponding to binary data to be encoded includes:

and converting the binary data into ASCII characters to obtain the character sequence.

Optionally, the preset correspondence includes a plurality of two-bit ternary data and a base pair corresponding to each two-bit ternary data.

Optionally, the preset correspondence includes at least one of the following correspondences:

00 corresponds to AT;

11 corresponds to AC;

22 corresponds to AG;

01 corresponds to GC;

02 corresponds to GT;

10 corresponds to CT;

20 corresponds to GA;

12 corresponds to GG;

21 corresponds to TG.

Optionally, the encoding the target ternary data according to a preset correspondence, and determining the target base sequence according to the encoding result includes:

coding the target ternary data according to the preset corresponding relation to obtain a first base sequence;

replacing an initiation codon appearing in the first base sequence to obtain a second base sequence, wherein the second base sequence does not contain the initiation codon;

and determining the target base sequence according to the second base sequence.

Optionally, the initiation codon is a three-position designated base sequence;

the replacing of the initiation codon present in the first base sequence comprises:

and according to a preset replacement rule, performing base replacement on the last two bases in the two base pairs with the initiation codon in the first base sequence until the first base sequence after the base replacement does not contain the initiation codon.

Optionally, the preset replacement rule includes at least one of the following replacement rules:

replacing GT with AA;

replacing GA with TT;

GG is replaced by CC;

GC was replaced by TA;

GT was replaced with CA.

Optionally, said determining said target base sequence from said second base sequence comprises:

performing base substitution on the second base sequence according to a preset substitution target and a preset substitution rule, wherein the preset substitution target comprises at least one of GC content requirements and GC distribution requirements;

and determining a second base sequence after base substitution meeting the preset substitution target as the target base sequence.

Optionally, before encoding the target ternary data according to the preset corresponding relationship, the method further includes:

respectively converting the character number and the character frequency sequence into ternary data to obtain first data and second data, wherein the character number refers to the number of the characters, and the character frequency sequence refers to the characters in the sequence from large to small according to the frequency of the characters;

splicing the first data, the second data and the target ternary data to obtain spliced ternary data;

the encoding the target ternary data according to the preset corresponding relation comprises the following steps:

and coding the spliced ternary data according to the preset corresponding relation.

Optionally, before constructing the first coding table according to the frequencies of the plurality of characters and the ternary codes of the leaf nodes in the preset full ternary tree, the method further includes:

construction of a column containing 2nA full ternary tree of leaf nodes;

and sequentially coding paths between the father node and the child nodes in the full ternary tree into 0,1 and 2 according to the sequence from left to right to obtain the preset full ternary tree, wherein the ternary code of each leaf node in the preset full ternary tree is a path code from a root node to each child node.

Optionally, after the encoding the target ternary data according to the preset correspondence and determining the target base sequence based on the encoding result, the method further includes:

carrying out segmentation treatment on the target base sequence to obtain a plurality of first segmented base sequences;

adding segmentation information to each first segmentation base sequence in the plurality of first segmentation base sequences to obtain a plurality of second segmentation base sequences, wherein the segmentation information comprises at least one of a primer, address information and an error correction code;

generating a plurality of DNAs based on the plurality of second segment base sequences.

In a second aspect, a decoding method is provided, the method comprising:

acquiring a target base sequence to be decoded;

decoding the target base sequence according to a preset corresponding relation, and determining target ternary data based on a decoding result, wherein the preset corresponding relation is used for indicating the corresponding relation between the ternary data and base pairs;

decoding the target ternary data according to ternary codes of all leaf nodes in a preset complete ternary tree to obtain a character sequence, wherein the preset complete ternary tree comprises 2nA complete ternary tree of each leaf node, wherein paths between a father node and a son node in the preset complete ternary tree are sequentially encoded into 0,1 and 2, and n is the bit number of an encoding unit of the binary data;

and determining binary data corresponding to the character sequence.

Optionally, the acquiring a target base sequence to be decoded includes:

obtaining a second segmented base sequence stored in each DNA in the plurality of DNAs to obtain a plurality of second segmented base sequences, wherein each second segmented base sequence comprises a first segmented base sequence and segmentation information, and the segmentation information comprises at least one of a primer, address information and an error correction code;

and acquiring a first segmented base sequence in the plurality of second segmented base sequences according to the segmentation information in the plurality of second segmented base sequences, and performing splicing treatment on the first segmented base sequence in the plurality of second segmented base sequences to obtain the target base sequence.

Optionally, the decoding the target base sequence according to a preset correspondence, and determining target ternary data based on a decoding result includes:

decoding the target base sequence according to the preset corresponding relation to obtain splicing ternary data;

and acquiring first data, second data and the target ternary data from the spliced ternary data, wherein the first data is obtained by carrying out ternary conversion on the number of characters, and the second data is obtained by carrying out ternary conversion on various characters according to the sequence of the frequencies of the characters from large to small.

Optionally, before decoding the target ternary data according to the ternary codes of the leaf nodes in the preset full ternary tree, the method further includes:

according to the first data and the second data, the number of the characters and various characters which are sequenced from large to small according to the frequency of the characters are respectively determined;

the decoding the target ternary data according to the ternary codes of all leaf nodes in the preset complete ternary tree to obtain the character sequence comprises the following steps:

constructing a first coding table according to a plurality of characters which are ordered according to the frequency of the characters from large to small and the ternary codes of all leaf nodes in the preset complete ternary tree, wherein the first coding table is used for indicating the ternary codes corresponding to each character in the plurality of characters;

and coding the target ternary data according to the first coding table to obtain the character sequence.

Optionally, before decoding the target base sequence according to the preset correspondence, the method further includes:

carrying out base substitution on the target base sequence according to a reverse substitution rule of a preset substitution rule to obtain a first base sequence;

the decoding the target base sequence according to the preset corresponding relation comprises:

and decoding the first base sequence according to the preset corresponding relation.

Optionally, the preset replacement rule includes at least one of the following replacement rules:

replacing GT with AA;

replacing GA with TT;

GG is replaced by CC;

GC was replaced by TA;

GT was replaced with CA.

In a third aspect, an encoding apparatus is provided, the apparatus comprising:

the first determining module is used for determining a character sequence corresponding to binary data to be coded;

a second determining module, configured to determine a plurality of characters appearing in the character sequence and a frequency of each character, where the frequency of each character is used to indicate a number of times that each character appears in the character sequence;

the first construction module is used for constructing a first coding table according to the frequencies of the characters and the ternary codes of all leaf nodes in the preset complete ternary tree; wherein the predetermined full ternary tree is comprised of 2nA full ternary tree of the leaf nodes, wherein paths between a parent node and a child node in the preset full ternary tree are sequentially encoded to be 0,1 and 2, the first encoding table is used for indicating ternary codes corresponding to each character in the plurality of characters, and n is the bit number of an encoding unit of the binary data;

the first coding module is used for coding the character sequence according to the first coding table to obtain target ternary data;

and the second coding module is used for coding the target ternary data according to a preset corresponding relation and determining a target base sequence according to a coding result, wherein the preset corresponding relation is used for indicating the corresponding relation between the ternary data and the base pairs.

Optionally, the first building block is configured to:

sequencing the characters according to the sequence of the frequencies of the characters from large to small to obtain the sequenced characters;

according to the sequence of the code length from short to long, determining a plurality of ternary codes which are ordered at the top from the ternary codes of each leaf node in the preset complete ternary tree, wherein the number of the ternary codes is the same as that of the characters;

sequencing the plurality of ternary codes according to the sequence of the code length from short to long to obtain the sequenced plurality of ternary codes;

establishing a one-to-one correspondence relationship between the sorted characters and the sorted ternary codes to obtain the first code table; the larger the frequency of the characters in the first coding table is, the shorter the ternary codes corresponding to the characters are.

Optionally, the first building block is configured to:

sequentially numbering each leaf node in the preset complete ternary tree according to the coding length of the corresponding ternary code to obtain the serial number of each leaf node;

establishing a corresponding relation between the sequence number of each leaf node and the ternary code to obtain a second code table;

and constructing the first coding table according to the frequencies of the characters and the second coding table.

Optionally, the first determining module is configured to:

and converting the binary data into ASCII characters to obtain the character sequence.

Optionally, the preset correspondence includes a plurality of two-bit ternary data and a base pair corresponding to each two-bit ternary data.

Optionally, the preset correspondence includes at least one of the following correspondences:

00 corresponds to AT;

11 corresponds to AC;

22 corresponds to AG;

01 corresponds to GC;

02 corresponds to GT;

10 corresponds to CT;

20 corresponds to GA;

12 corresponds to GG;

21 corresponds to TG.

Optionally, the second encoding module is configured to:

the coding unit is used for coding the target ternary data according to the preset corresponding relation to obtain a first base sequence;

a replacing unit, configured to replace an initiation codon appearing in the first base sequence to obtain a second base sequence, where the second base sequence does not include the initiation codon;

a determining unit configured to determine the target base sequence from the second base sequence.

Optionally, the initiation codon is a three-position designated base sequence; the replacement unit is configured to:

and according to a preset replacement rule, performing base replacement on the last two bases in the two base pairs with the initiation codon in the first base sequence until the first base sequence after the base replacement does not contain the initiation codon.

Optionally, the preset replacement rule includes at least one of the following replacement rules:

replacing GT with AA;

replacing GA with TT;

GG is replaced by CC;

GC was replaced by TA;

GT was replaced with CA.

Optionally, the determining unit is configured to:

performing base substitution on the second base sequence according to a preset substitution target and a preset substitution rule, wherein the preset substitution target comprises at least one of GC content requirements and GC distribution requirements;

and determining a second base sequence after base substitution meeting the preset substitution target as the target base sequence.

Optionally, the apparatus further comprises:

the conversion module is used for converting the number of the characters into ternary data to obtain first data, wherein the number of the characters refers to the number of various characters; converting the characters in the sequence from large frequency to small frequency into ternary data to obtain second data;

the splicing module is used for splicing the first data, the second data and the target ternary data to obtain spliced ternary data;

and the second coding module is used for coding the spliced ternary data according to the preset corresponding relation.

Optionally, the apparatus further comprises a second building block configured to:

construction of a column containing 2nA full ternary tree of leaf nodes;

and sequentially coding paths between the father node and the child nodes in the full ternary tree into 0,1 and 2 according to the sequence from left to right to obtain the preset full ternary tree, wherein the ternary code of each leaf node in the preset full ternary tree is a path code from a root node to each child node.

Optionally, the apparatus further comprises:

the segmentation module is used for carrying out segmentation processing on the target base sequence to obtain a plurality of first segmented base sequences;

an adding module, configured to add segmentation information to each of the plurality of first segment base sequences to obtain a plurality of second segment base sequences, where the segmentation information includes at least one of a primer, address information, and an error correction code;

a storage module for generating a plurality of DNAs based on the plurality of second segmented base sequences.

In a fourth aspect, there is provided a decoding apparatus, the apparatus comprising:

the acquisition module is used for acquiring a target base sequence to be decoded;

a first decoding module, configured to decode the target base sequence according to a preset correspondence, and determine target ternary data based on a decoding result, where the preset correspondence is used to indicate a correspondence between ternary data and base pairs;

a second decoding module for performing ternary coding on the target ternary data according to the ternary coding of each leaf node in the preset complete ternary treeDecoding to obtain a character sequence, wherein the preset complete ternary tree comprises 2nA complete ternary tree of each leaf node, wherein paths between a father node and a son node in the preset complete ternary tree are sequentially encoded into 0,1 and 2, and n is the bit number of an encoding unit of the binary data;

and the first determining module is used for determining binary data corresponding to the character sequence.

Optionally, the obtaining module is configured to:

obtaining a second segmented base sequence stored in each DNA in the plurality of DNAs to obtain a plurality of second segmented base sequences, wherein each second segmented base sequence comprises a first segmented base sequence and segmentation information, and the segmentation information comprises at least one of a primer, address information and an error correction code;

and acquiring a first segmented base sequence in the plurality of second segmented base sequences according to the segmentation information in the plurality of second segmented base sequences, and performing splicing treatment on the first segmented base sequence in the plurality of second segmented base sequences to obtain the target base sequence.

Optionally, the first decoding module is configured to:

decoding the target base sequence according to the preset corresponding relation to obtain splicing ternary data;

and acquiring first data, second data and the target ternary data from the spliced ternary data, wherein the first data is obtained by carrying out ternary conversion on the number of characters, and the second data is obtained by carrying out ternary conversion on various characters according to the sequence of the frequencies of the characters from large to small.

Optionally, the apparatus further comprises:

the second determining module is used for respectively determining the number of the characters and various characters which are sequenced according to the frequency of the characters from large to small according to the first data and the second data;

the second decoding module is to:

constructing a first coding table according to a plurality of characters which are ordered according to the frequency of the characters from large to small and the ternary codes of all leaf nodes in the preset complete ternary tree, wherein the first coding table is used for indicating the ternary codes corresponding to each character in the plurality of characters;

and coding the target ternary data according to the first coding table to obtain the character sequence.

Optionally, the apparatus further comprises:

a replacing module, configured to perform base replacement on the target base sequence according to a reverse replacement rule of a preset replacement rule to obtain a first base sequence;

the first decoding module is configured to decode the first base sequence according to the preset correspondence.

Optionally, the preset replacement rule includes at least one of the following replacement rules:

replacing GT with AA;

replacing GA with TT;

GG is replaced by CC;

GC was replaced by TA;

GT was replaced with CA.

In a fifth aspect, there is provided an encoding apparatus, the apparatus comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements any one of the encoding methods of the first aspect when executing the computer program.

In a sixth aspect, there is provided a decoding apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements any of the decoding methods of the second aspect when executing the computer program.

In a seventh aspect, a computer-readable storage medium is provided, which stores a computer program, and the computer program realizes any one of the encoding methods described in the first aspect or any one of the decoding methods described in the second aspect when being executed by a processor.

In an eighth aspect, there is provided a computer program product for implementing any one of the encoding methods of the first aspect or any one of the decoding methods of the second aspect when the computer program product is executed.

In a ninth aspect, a coding and decoding system is provided, which comprises a coding device and a decoding device, wherein the coding device is used for realizing any one of the coding methods described in the first aspect, and the decoding device is used for realizing any one of the decoding methods described in the second aspect.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

in the embodiment of the application, a complete ternary tree with a fixed rule is preset, so that when binary data are coded, a coding table can be directly constructed on the basis of the complete ternary tree with the fixed rule and character frequencies in a character sequence corresponding to the binary data, then the binary data are coded into ternary data on the basis of the constructed coding table, and the ternary data are coded into a base sequence on the basis of a preset corresponding relation. Because the complete ternary tree with the preset fixed rule is used for constructing the coding table for coding, the Huffman tree with the minimum weighted path length does not need to be constructed based on the character frequency during coding, the coding rule is simpler, and the coding efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of an encoding method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a pre-set full ternary tree provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a second base sequence fragment provided in an embodiment of the present application;

fig. 4 is a flowchart of a decoding method provided in an embodiment of the present application;

fig. 5 is a block diagram of an encoding apparatus according to an embodiment of the present application;

fig. 6 is a block diagram of a decoding apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the embodiments of the present application in detail, an application scenario of the embodiments of the present application will be described.

The encoding method and the decoding method provided by the embodiment of the application can be applied to the DNA storage technology. The DNA storage technology is a technology for storing data by using artificially synthesized DNA as a storage medium, and has the advantages of high efficiency, large storage capacity, long storage time, easy acquisition, and maintenance-free, and has recently become a research hotspot in the field of storage technology. In the DNA storage technology, for binary data to be stored, it is necessary to encode the binary data into a base sequence first, and then artificially synthesize DNA based on the encoded base sequence to store the binary data in the DNA. Therefore, encoding binary data into a base sequence is an important research point in DNA storage technology.

The base sequence generally includes four bases, i.e., a (Adenine), G (Guanine), T (Thymine), and C (Cytosine). Bases, which are also called nucleobases or nitrogenous bases in biochemistry, are nitrogen-containing compounds that form nucleosides, which are also components of nucleotides. Monomers such as bases, nucleosides, and nucleotides constitute the basic building blocks of nucleic acids.

In the related art, since the base sequence includes A, G, T, C four bases, in the process of encoding binary data into a base sequence, binary data represented by 0 and 1 can be directly converted into quadruple data represented by A, G, T, C, thereby obtaining a base sequence. For example, 00 in binary data may be converted to a, 01 to G, 10 to T, and 11 to C. However, when binary data represented by 0 and 1 is directly converted into quadruple data represented by A, G, T, C, a long repetitive base sequence is likely to occur. For example, if there are many 0's in the binary data, a number of A's will easily appear in the converted base sequence. This coding manner of directly coding binary data into a base sequence is disadvantageous to artificially synthesize DNA because a repetitive base sequence that is too long is disadvantageous to artificially synthesize DNA.

In the related art, in order to avoid the problem that the base sequence directly encoded by binary data is not favorable for artificial synthesis of DNA, the binary data may be encoded into ternary data, and then the ternary data may be encoded into the base sequence.

In one implementation manner in the related art, the binary data represented by 0 and 1 may be converted into the ternary data represented by 0,1 and 2 according to the standard binary conversion relationship between binary and ternary, and then the ternary data may be encoded into the base sequence. However, after the binary data is converted according to the standard binary conversion relationship between the binary system and the ternary system, the character represented by each 8-bit binary data needs to be represented by 6-bit ternary data, and the coded ternary data has a large coding length, which results in large space waste.

In another implementation manner in the related art, in order to reduce the coding length of the ternary data, the binary data may be coded into the ternary data by using a general huffman coding method. For example, for each file to be encoded, it is necessary to determine multiple characters appearing in a character sequence corresponding to binary data in the file and the frequency of each character, then construct a huffman tree with the smallest weighted path length according to the frequencies of the multiple characters, and then construct an encoding table according to the ternary encoding of each node in the huffman tree, so as to encode the character sequence corresponding to the binary data according to the encoding table. That is, for different files to be encoded, different huffman trees with the smallest weighted path length need to be constructed according to the character frequency of the corresponding file, and the process of constructing the huffman tree with the smallest weighted path length according to the character frequency is complicated, so the encoding rule is complicated, and the encoding efficiency is low.

In another implementation manner in the related art, in order to avoid the problem that the coding manner of creating different huffman trees for different files is complicated, a common huffman coding manner may be adopted in advance to uniformly construct the same huffman coding tree for different files, for example, the character frequencies in a large number of files are counted in advance, and the huffman coding trees are uniformly constructed for different files according to the counted character frequencies.

Specifically, the process of uniformly casting huffman code trees for different files according to the counted character frequency may include the following steps: 1) each of 256 characters which can be represented by binary data is taken as a node, and the frequency of the character is taken as the weight omega of the corresponding nodeiAnd constructing n ternary trees to obtain a ternary tree set F ═ T1,T2,T3,., Tn. Wherein n is 256, and only one weighted value is ω in each ternary tree TiiIts subtree is empty. The frequency of characters is the number of times each character appears in a large number of files, which is counted in advance. The number n of nodes constructed by the Huffman tree needs to satisfy (n-1) mod2 ═ 0, if the number of nodes does not satisfy the construction of the Huffman tree, 1 redundant node is added and the weight is given to 0, and at the moment, n ═ n +1 can satisfy the construction of the Huffman tree. 2) And selecting three nodes with the minimum current weights from the F to form a subtree, wherein the weight of the root node of the tree is the sum of the weights of the child nodes. 3) These three nodes are deleted in F and the newly obtained ternary tree is placed in F. 4) Repeating steps 2) and 3) until F contains only one tree. At this time, the remaining tree is the huffman tree with the smallest weighted path length, and the huffman tree can be used as a huffman tree uniformly constructed for different files. Then, the ternary system of each node in the uniformly constructed Huffman tree can be used for encodingAnd (5) coding, and constructing a coding table.

Subsequently, when each file is coded, the character sequence corresponding to the binary data in the file can be coded according to the coding table to obtain ternary data, and then the ternary data is coded into a base sequence, so that a target base sequence corresponding to the binary data is obtained. However, the uniformly constructed huffman tree also needs to construct the huffman tree with the smallest weighted path length according to the frequency of the character, so the algorithm is more complex and the coding efficiency is lower. In addition, in order to recover the original data by subsequent decoding, the encoding table needs to be stored outside the encoded data. However, the reliability of the externally stored coding table is low, for example, if the externally stored coding table is changed, or the externally stored coding table cannot be accurately acquired due to software, hardware, or the like, the coded data cannot be accurately decoded, so that the data reliability is reduced. If the coding table is stored together with the coded data, the storage space required by the coded data is larger, which may offset the space consumption reduced by the huffman coding method.

In the embodiment of the present application, in order to solve the above-mentioned problems that the encoding rule for encoding binary data into ternary data by using a general huffman encoding method is complex and the encoding efficiency is low, a complete ternary tree with a fixed rule is preset, so that during encoding, an encoding table can be directly constructed based on the complete ternary tree with the fixed rule and the character frequency in the character sequence corresponding to the binary data, the binary data is encoded into ternary data based on the constructed encoding table, and then the ternary data is encoded into a base sequence based on a preset correspondence. Because the complete ternary tree with the fixed rule is preset, a Huffman tree with the minimum weighted path length does not need to be constructed based on the character frequency during coding, the coding rule is simpler, and the coding efficiency is improved. Moreover, because the encoding rule is simple, the encoding rule does not need to be stored in the encoded data of the DNA or outside the encoded data, but the encoding rule can be used as a default encoding rule for encoding binary data into ternary data, so that the binary data to be encoded is encoded according to the encoding rule by default, and the encoding efficiency can be improved under the condition of ensuring the reliability of the data.

Next, a coding method provided in an embodiment of the present application will be described in detail.

Fig. 1 is a flowchart of an encoding method provided in an embodiment of the present application, where the encoding method is applied to an encoding apparatus, where the encoding apparatus may be a computer device such as a terminal or a server, and the terminal includes, but is not limited to, a mobile phone, a tablet computer, a desktop computer, a Personal Digital Assistant (PDA), and the like, and the embodiment of the present application is not limited thereto. Referring to fig. 1, the method includes:

step 101: and determining a character sequence corresponding to binary data to be coded.

In the embodiment of the present application, for binary data to be encoded, the binary data may be first encoded in a ternary manner in the form of characters, so as to encode the binary data into ternary data, and then encode the ternary data into a base sequence. Before ternary coding binary data in the form of characters, a character sequence corresponding to the binary data needs to be determined.

The character sequence corresponding to the binary data may be an ASCII (American Standard Code for Information exchange) character sequence. For example, binary data may be converted to ASCII characters, thereby resulting in the sequence of characters.

The binary data to be encoded may be encoded in units of bytes (8-bit binary data) or may be encoded in units of half bytes (4-bit binary data), which is not limited in the embodiment of the present application. If encoding is performed in units of bytes, 8-bit binary data can represent 256 characters; if encoding is performed in units of half bytes, 4-bit binary data can represent 16 characters.

In a possible implementation manner, an object file to be stored may be obtained, and a text stored in the object file is used as a character sequence corresponding to binary data to be encoded. Or, acquiring a target file to be stored, converting data stored in the target file into binary data to obtain binary data to be encoded, and then determining a character sequence corresponding to the binary data.

Step 102: a plurality of characters occurring in the sequence of characters and a frequency of each character is determined, the frequency of each character indicating a number of times each character occurs in the sequence of characters.

For the character sequence corresponding to the binary data, a plurality of characters appearing in the character sequence and the frequency of each character in the plurality of characters can be counted, so that an encoding table can be constructed according to the frequencies of the plurality of characters.

For example, the target file to be stored may be a text file "Coding by Huffman", which is a character sequence corresponding to binary data to be encoded. For the convenience of the reader to understand, the text file "Coding by Huffman" can be converted into hexadecimal data "436 f 64696 e 6720627920487566666 d 616 e", which is represented by hexadecimal data.

Wherein, a total of 12 characters appear in the hexadecimal data "436 f 64696 e 6720627920487566666 d 616 e", and the 12 characters and the corresponding frequencies are respectively: ('6',13), ('4',3), ('7',3), ('2',3), ('9',2), ('e',2), ('0',2), ('3',1), ('f',1), ('8',1), ('5',1), ('d',1), ('1', 1).

Step 103: and constructing a first coding table according to the frequencies of the characters and the ternary codes of all leaf nodes in the preset complete ternary tree.

In the embodiment of the application, the preset complete ternary tree with the preset fixed rule can be used for directly constructing the coding table based on the complete ternary tree with the fixed rule and the character frequency in the character sequence corresponding to the binary data when the binary data is coded, so that the Huffman tree with the minimum weighted path length does not need to be constructed based on the character frequency when the binary data is coded, and the coding rule is simpler.

Wherein the predetermined complete ternary tree comprises 2nA complete ternary tree of leaf nodes, n being the bit of the coding unit of the binary dataA leaf node is a node that has no children. The paths between the parent node and the child node in the full ternary tree are sequentially encoded as 0,1 and 2, for example, the paths between the parent node and the child node may be sequentially encoded as 0,1 and 2 from left to right, that is, the leftmost path is encoded as 0, the middle path is encoded as 1, and the rightmost path is encoded as 2. The ternary code of each leaf node in the full ternary tree is preset as the code of the path from the root node to each child node.

For example, if the binary data is encoded in units of bytes, the bit number n of the encoding unit of the binary data is 8, and the predetermined full ternary tree includes 28(256) A full ternary tree of leaf nodes. If the binary data is encoded in units of half bytes, the bit number n of the encoding unit of the binary data is 4, and the predetermined complete ternary tree is 24(16) A full ternary tree of leaf nodes.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a preset holotreeqitree according to an embodiment of the present disclosure. As shown in fig. 2, the predetermined full ternary tree is a full ternary tree including 16 leaf nodes. The preset full ternary tree comprises 4 layers of nodes, wherein the first layer of nodes are root nodes, the third layer of nodes comprise 5 leaf nodes, and the fourth layer of nodes comprise 11 leaf nodes. The 16 leaf nodes are numbered 0,1, 2, 15 from left to right. The paths between the parent nodes and the child nodes in the preset full ternary tree are sequentially encoded into 0,1 and 2 from left to right, and the ternary code of each leaf node is the code of the path from the root node to each child node. For example, the leaf node with rank 0 has a ternary code of 000, the leaf node with rank 1 has a ternary code of 001, the leaf node with rank 2 has a ternary code of 002. It can be seen that, for the predetermined full ternary tree including 16 leaf nodes, the encoding length of the ternary code of 11 leaf nodes is 3 bits, and the encoding length of the ternary code of 5 leaf nodes is two bits. For example, the code length of the ternary code of the leaf node with serial number 0,1,. and 10 is 3 bits, and the code length of the ternary code of the leaf node with serial number 11, 12,. and 15 is 2 bits.

In addition, if the predetermined full ternary tree is a full ternary tree including 256 leaf nodes, the predetermined full ternary tree includes seven levels of nodes. Wherein, the first layer node is a root node, the sixth layer node comprises 236 leaf nodes, and the seventh layer node comprises 20 leaf nodes. The sequence numbers of 256 leaf nodes in the preset complete ternary tree are 0,1, 2, 255 sequentially from left to right. The paths between the parent nodes and the child nodes in the full ternary tree are sequentially encoded into 0,1 and 2 from left to right, and the ternary code of each leaf node is the code of the path from the root node to each child node. For example, the ternary code of the leaf node with sequence number 0 is 000000, the ternary code of the leaf node with sequence number 1 is 000001, the ternary code of the leaf node with sequence number 2 is 000002. It can be seen that, for the predetermined full ternary tree including 256 leaf nodes, the code length of the ternary code of 236 leaf nodes is 5 bits, and the code length of the ternary code of 20 leaf nodes is 6 bits. For example, the code length of the ternary code for the leaf node with serial number 0,1, ·, 19 is 6 bits, and the code length of the ternary code for the leaf node with serial number 20, 21, ·, 255 is 5 bits.

For a character sequence corresponding to binary data to be encoded, a first encoding table may be constructed based on frequencies of various characters appearing in the character sequence and ternary encoding of each leaf node in a preset full ternary tree. The first coding table is used for indicating the ternary coding corresponding to each character in the multiple characters and converting the character sequence corresponding to the binary data into the ternary data. For example, the first encoding table includes a plurality of characters and ternary data corresponding to each character.

As an example, the operation of constructing the first encoding table according to the frequencies of the plurality of characters and the ternary encoding of each leaf node in the preset full ternary tree may include: sorting the characters according to the sequence of the frequencies of the characters from large to small to obtain the sorted characters; according to the sequence of the code length from short to long, determining a plurality of ternary codes which are sequenced at the top from ternary codes of all leaf nodes in a preset complete ternary tree, wherein the number of the ternary codes is the same as that of various characters; sequencing the determined ternary codes according to the sequence of the code length from short to long to obtain a plurality of sequenced ternary codes; and establishing a one-to-one corresponding relation between the sorted characters and the sorted ternary codes to obtain a first code table. The larger the frequency of the characters in the first coding table is, the shorter the ternary codes corresponding to the characters are.

That is, the first encoding table includes a plurality of characters and corresponding ternary codes. The ternary codes corresponding to the characters are obtained from the ternary codes of all leaf nodes in the preset complete ternary tree, the code length is shorter, the frequency of the characters is inversely proportional to the length of the corresponding ternary codes, and the larger the frequency of the characters is, the shorter the corresponding ternary codes are. Therefore, the average coding length of the ternary data obtained after coding according to the first coding table is shorter, so that the coding length of the coded data is reduced, and the storage space of the coded data is saved.

As an example, the predetermined full ternary tree may be constructed before the first coding table is constructed according to the frequencies of the plurality of characters and the ternary codes of the leaf nodes in the predetermined full ternary tree defined by the predetermined rule. Wherein, the process of constructing the preset complete ternary tree may include: construction of a column containing 2nAnd (3) coding paths between a father node and a child node in the complete ternary tree into 0,1 and 2 in sequence from left to right to obtain a preset complete ternary tree. And presetting the ternary code of each leaf node in the full ternary tree as a path code from the root node to each child node.

For example, referring to fig. 2, a full ternary tree including 16 leaf nodes may be constructed first, and then paths between a parent node and a child node in the full ternary tree are sequentially encoded as 0,1, and 2 from left to right, so as to obtain the preset full ternary tree shown in fig. 2.

As an example, after the preset full ternary tree is constructed, each leaf node in the preset full ternary tree may be sequentially numbered according to a length sequence of the code length corresponding to the ternary code, so as to obtain a sequence number of each leaf node. And then, establishing a corresponding relation between the serial number of each leaf node and the ternary code to obtain a second code table. And constructing a first encoding table according to the frequencies of the characters and the second encoding table.

For example, the leaf nodes in the preset full ternary tree may be sequentially numbered according to the sequence of the code length of the corresponding ternary code from long to short or from short to long. In addition, the starting serial number of the leaf node may be 0 or 1, which is not limited in this embodiment of the present application.

Wherein, according to the frequencies of the plurality of characters and the second encoding table, the operation of constructing the first encoding table may include: sorting the characters according to the sequence of the frequencies of the characters from large to small; determining a plurality of serial numbers which are sequenced at the top from the second coding table according to the sequence of the serial numbers from large to small, wherein the number of the serial numbers is the same as the number of the characters; and establishing a one-to-one correspondence between various characters which are sorted according to the sequence from the large frequency to the small frequency of the characters and the ternary codes corresponding to the plurality of serial numbers which are sorted according to the sequence from the large frequency to the small frequency to obtain a first code table.

For example, assuming that binary data is encoded in units of half bytes, and a preset full ternary tree includes 16 leaf nodes, a second encoding table constructed based on the preset full ternary tree may be as shown in table 1 below:

TABLE 1

Serial number Ternary coding Serial number Ternary coding
0 000 8 022
1 001 9 100
2 002 10 101
3 010 11 11
4 011 12 12
5 012 13 20
6 020 14 21
7 021 15 22

Assume that the character sequence corresponding to binary data is the above-mentioned hexadecimal data "436 f 64696 e 6720627920487566666 d 616 e", which includes 12 characters, and the 12 characters and the corresponding frequencies are: ('6',13), ('4',3), ('7',3), ('2',3), ('9',2), ('e',2), ('0',2), ('3',1), ('f',1), ('8',1), ('5',1), ('d',1), ('1', 1). The first encoding table may be constructed based on the frequencies of the 12 characters and the second encoding table shown in table 1 above, and the constructed first encoding table may be shown in table 2 below.

TABLE 2

Serial number Character(s) Ternary coding Serial number Character(s) Ternary coding
0 000 8 3 022
1 001 9 0 100
2 002 10 e 101
3 1 010 11 9 11
4 d 011 12 2 12
5 5 012 13 7 20
6 8 020 14 4 21
7 f 021 15 6 22

As shown in table 2, the ternary codes corresponding to the 12 sorted serial numbers can be selected from table 1 and used as the ternary codes of the 12 characters, and the frequency of the characters is proportional to the serial numbers of the ternary codes.

For another example, if the binary data is encoded in units of bytes, and the predetermined full ternary tree includes 256 leaf nodes, the second encoding table constructed based on the predetermined full ternary tree may be as shown in table 3 below:

TABLE 3

Thereafter, for the above character sequence "436 f 64696 e 6720627920487566666 d 616 e", a first encoding table may be constructed based on the frequencies of 12 characters appearing in the character sequence and the second encoding table shown in the above table 3. The first encoding table constructed may be as shown in table 4 below.

TABLE 4

Serial number Character(s) Ternary coding Serial number Character(s) Ternary coding
239 22101 247 3 22200
240 22102 248 0 22201
241 22110 249 e 22202
242 1 22111 250 9 22210
243 d 22112 251 2 22211
244 5 22120 252 7 22212
245 8 22121 253 4 22220
246 f 22122 254 6 22221

As shown in table 4, the ternary codes corresponding to the 12 sorted serial numbers can be selected from table 3 as the ternary codes of the 12 characters, and the frequency of the characters is proportional to the serial numbers of the ternary codes.

In addition, as can be seen from the encoding table constructed based on the predetermined full ternary tree including 256 leaf nodes, for 256 kinds of characters represented by 8-bit binary data, if ternary is encoded based on the encoding table, some characters may be encoded as 5-bit ternary data, and some characters may be represented as 6-bit characters. Compared with the standard binary conversion relation between binary and ternary, the encoding mode that each 8-bit binary data needs 6-bit ternary data for representation can reduce the average encoding length of the binary data and save the storage space of the encoded data.

Step 104: and coding the character sequence according to the first coding table to obtain target ternary data.

That is, each character in the character sequence corresponding to the binary data may be converted into a corresponding ternary code according to the first code table, so as to obtain the target ternary data.

For example, if the character sequence "436 f 64696 e 6720627920487566666 d 616 e" is encoded according to the first encoding table described in table 2, the target ternary data "210222202122212211221012220121002212201112100210202001222222222220112201022101" can be obtained.

In a possible implementation manner, besides encoding the character sequence according to the first encoding table to obtain the target ternary data, the number of characters may be converted into ternary data to obtain first data, and a plurality of characters in the order of the frequencies of the characters from large to small are converted into ternary data to obtain second data.

The number of characters refers to the number of the various characters, and the number of characters may be decimal data. For convenience of explanation, the plurality of characters in the order of the frequencies of the characters from large to small may also be referred to as character frequency ordering. For example, the decimal character number may be converted into ternary data in a standard binary conversion relationship between decimal and ternary. And according to the binary conversion relation between the binary and the ternary corresponding to the characters, the character frequency sequence is converted into ternary data. For example, if the characters are ASCII characters, the character frequency ordering may be converted into ternary data according to the binary conversion relationship between ASCII characters and ternary.

In one possible implementation, for a character sequence corresponding to binary data, the character sequence may also be stored in a specified field format. For example, the specified file may be generated based on a character sequence corresponding to binary data. Wherein, the designated file includes 3 fields, and the first field of the 3 fields is used for storing the number of characters (Byte Count, BC). The second field is used to store the character Frequency sort (BF). The third field is used to store encoded Data (Encoding Data), which is the target ternary Data obtained by Encoding a character sequence of binary numbers.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating a field format of a specific file according to an embodiment of the present application. As shown in fig. 3, the specification file includes three fields, a first field for storing the number of characters, a second field for storing the character frequency ordering, and a third field for storing the encoded data.

For example, the designated files generated based on the character sequence "436 f 64696 e 6720627920487566666 d 616 e" are shown in table 5 below. As shown in table 5 below, the first field of the designated file is used to store the number of characters "13", the second field is used to store the character frequency ordering "64729 e 03 f 85 d 1", and the third field is used to store the target ternary data "210222202122212211221012220121002212201112100210202001222222222220112201022101" obtained by encoding the character sequence of binary numbers.

TABLE 5

After the designated file in the designated file format is generated based on the character sequence corresponding to the binary data, the data in the first field and the second field in the designated file can be respectively converted into ternary data, so that the data stored in each field in the designated file are the ternary data. That is, the number of characters in the first field and the frequency sequence of characters in the second field are converted into ternary data, so as to obtain the first data and the second data. Therefore, the data stored in the designated file is the spliced ternary data obtained by splicing the first data, the second data and the target ternary data.

For example, for the specified file shown in table 5 above, after converting the data in the first field and the second field in table 5 into ternary data, the specified file shown in table 6 below can be obtained.

TABLE 6

As shown in the above table 6, the number of characters "13" in the first field in table 5 may be converted into ternary data "111", and the character frequency ordering "64729 e 03 f 85 d 1" in the second field in table 5 may be converted into ternary data "020011021002021112000010120022012111001".

Step 105: and coding the target ternary data according to a preset corresponding relation, and determining a target base sequence according to a coding result.

The preset corresponding relation is used for indicating the corresponding relation between the ternary data and the base pairs. According to the preset correspondence, the ternary data can be converted into corresponding base pairs in turn according to the coding unit conversion, thereby obtaining a base sequence consisting of the base pairs.

As an example, the coding unit of the ternary data may be two bits, so that each two bits of ternary data may be sequentially converted into corresponding base pairs, resulting in a base sequence consisting of base pairs.

As one example, the preset correspondence may include a plurality of two-bit ternary data and a base pair corresponding to each two-bit ternary data. For example, the preset correspondence relationship may be as shown in table 7 below:

TABLE 7

Ternary data of two bits Corresponding base pair
00 AT
11 AC
22 AG
01 GC
02 GT
10 CT
20 GA
12 GG
21 TG

In addition, by encoding the ternary data based on the predetermined correspondence relationship shown in table 7, the number of repeated base sequences in the base sequences obtained by encoding is 2 at the maximum, and thus the occurrence of long repeated base sequences can be further reduced.

It should be noted that, in the embodiment of the present application, only the preset correspondence is the correspondence shown in the above table 7 as an example, the correspondence shown in the table 7 does not limit the preset correspondence, in other embodiments, the preset correspondence may be set as a correspondence in other forms, and the embodiment of the present application does not limit this.

As an example, the operation of encoding the target ternary data according to the preset correspondence defined by the preset rule and determining the target base sequence according to the encoding result includes the following steps:

1) and coding the target ternary data according to a preset corresponding relation to obtain a first base sequence.

That is, the first nucleotide sequence can be obtained by converting the target ternary data into a nucleotide sequence consisting of A, G, C, T based on the predetermined correspondence relationship.

For example, the first nucleotide sequence can be obtained by encoding the ternary data "210222202122212211221012220121002212201112100210202001222222222220112201022101" according to the predetermined correspondence shown in table 7: "TG GT AG GA TG AG TG AG AC AG CT GG AG GC TG AT AG GG GA AC GG CT GT CT GA GA GC AG AG AG AG AG GA AC AG GC GT TG GC".

Further, the splicing ternary data can be encoded according to a preset corresponding relation to obtain a first base sequence. The splicing ternary data is obtained by splicing first data, second data and target ternary data, the first data is obtained by carrying out ternary conversion on character data, and the second data is obtained by carrying out ternary conversion on character frequency sequencing. Therefore, the encoded base sequence also comprises character information such as the number of characters, the frequency ordering of the characters and the like, and the encoded base sequence is convenient to decode based on the character information subsequently so as to recover the original data.

For example, after the splicing ternary data "111020011021002021112000010120022012111001210222202122212211221012220121002212201112100210202001222222222220112201022101" shown in table 6 is encoded according to the preset correspondence shown in table 7, the obtained first base sequence is: "AC CT GA GC CT TG AT GA TG AC GA AT GC GC GA GT GA GG AC CT GC TG GT AG GA TG AG TG AG AC AG CT GG AG GC TG AT AG GG GA AC GG CT GT CT GA GA GC AG AG AG AG AG GA AC AG GC GT TG GC". Wherein the GC content in the first base sequence is 57.5%.

2) And replacing the initiation codon appearing in the first base sequence to obtain a second base sequence, wherein the second base sequence does not contain the initiation codon.

Wherein the initiation codon is a codon specifying the initiation site of protein synthesis. Codons refer to sequences of triplet nucleotide residues on DNA that encode a particular amino acid. For example, the initiation codon may be a three-position designated base sequence such as ATG or the like. Since the initiation codon affects the synthesis of DNA, it is desirable to avoid the initiation codon in the encoded nucleotide sequence.

In the embodiment of the present application, in order to avoid the occurrence of the start codon in the encoded base sequence, after the first base sequence is obtained by encoding, the start codon in the first base sequence may be replaced to replace the start codon in the first base sequence with another base sequence, so as to obtain a second base sequence not including the start codon.

As an example, if the start codon is a three-bit designated base sequence, the last two bases in the two base pairs in which the start codon appears in the first base sequence may be subjected to base substitution according to a preset substitution rule to substitute the last two bases for other base pairs until the start codon is not included in the first base sequence after the base substitution.

Assuming that the ternary data is encoded by a specified base sequence in which two bases are converted and the start codon is three, the start codon occurs in at least four consecutive bases in the base sequence, and the start codon may be in the first three or the last three of the four bases. For this reason, in the process of replacing the start codon appearing in the first base sequence, the last two bases in the two base pairs where the start codon appears may be replaced to replace the start codon with another base sequence, thereby reducing the appearance of the start codon.

For example, the last two bases of the two base pairs of the ATG of the initiation codon may be GT, GA, GG, TG, GC, and if these base pairs are replaced with other base pairs, the two base pairs after replacement may not form ATG.

As an example, the preset replacement rule may include at least one of the following replacement rules: replacing GT with AA; replacing GA with TT; GG is replaced by CC; GC was replaced by TA; GT was replaced with CA. For example, the preset replacement rule E ═ { GT: AA, GA: TT, GG: CC, TG: CA, GC: TA }.

3) And determining the target base sequence according to the second base sequence.

When the target of substitution of the first nucleotide sequence is that the substituted first nucleotide sequence does not include an initiation codon, the second nucleotide sequence can be determined as the target nucleotide sequence as it is. If the alternative target also comprises other alternative targets, the second base sequence can be continuously replaced according to other alternative targets to obtain the target base sequence.

As one example, other alternative objectives may include at least one of GC content requirements and GC distribution requirements. Herein, the GC content requirement is used to indicate the content of the base pair GC in the target base sequence, for example, the GC content requirement may be 50%, adjacent to 50%, or 50% or more and 55% or less. Where, the proximity of 50% means that the difference between the GC content and 50% is within a predetermined error range, for example, the predetermined error range may be [0,1 ]. The GC distribution requirement is used to indicate the distribution of base pair GC in the target base sequence.

As an example, the base substitution of the second base sequence may be continued according to a preset substitution target until the second base sequence after the base substitution satisfies the preset substitution target position, and the second base sequence after the base substitution may be determined as the target base sequence. Wherein the preset replacement target comprises at least one of a GC content requirement and a GC distribution requirement. Thus, the GC content and GC distribution in the coded base sequence can be controlled, so that the coded base sequence is more beneficial to artificially synthesizing DNA.

When the second base sequence is substituted with a base, a specified base or a specified base sequence in the second base sequence may be substituted. For example, the base substitution may be performed on the second base sequence according to the above preset substitution rule, such as substituting AA for GT, TT for GA, CC for GG, TA for GC, and CA for GT appearing in the second base sequence until the substituted second base sequence satisfies the preset substitution target.

For example, after the splicing ternary data is encoded according to the predetermined correspondence relationship shown in table 7, the repeated base sequence in the obtained first base sequence is 2 at the maximum. If the first base sequence has "ATG", the last two bases of the two base pairs with "ATG" can be replaced according to the preset replacement rule, and if the first base sequence after replacement has "ATG", the replacement is continued until the first base sequence after replacement does not contain "ATG". If the substitution target further includes a GC content requirement, and the substituted first base sequence does not satisfy the GC content requirement, the substituted first base sequence may be further subjected to base substitution according to the preset substitution rule until the GC content requirement is satisfied. The length of the repeated base sequence of the target base sequence obtained by the substitution treatment is 6 or less at the maximum, and thus the length of the repeated base sequence of the encoded base sequence can be 6 or less.

For another example, after the splicing ternary data "111020011021002021112000010120022012111001210222202122212211221012220121002212201112100210202001222222222220112201022101" shown in table 6 is encoded according to the preset correspondence shown in table 7, the obtained first base sequence is: "AC CT GA GC CT TG AT GA TG AC GA AT GC GC GA GT GA GG AC CT GC TG GT AG GA TG AG TG AG AC AG CT GG AG GC TG AT AG GG GA AC GG CT GT CT GA GA GC AG AG AG AG AG GA AC AG GC GT TG GC".

After the last two bases in the two base pairs with the initiation codon 'ATG' are replaced for the first time according to the preset replacement rule E, the sequence of the first base sequence after replacement is obtained: "AC CT GA GC CT TG AT TT TG AC GA AT TA GC GA GT GA GG AC CT GC TG GT AG GA GA AG TG AG AC AG CT GG AG GC TG AT AG GG GA AC GG CT GT CT GA GA GC AG AG AG AG AG GA AC AG GC GT TG GC".

After the second two bases in the two base pairs with the initiation codon 'ATG' are replaced for the second time according to the preset replacement rule E, the sequence of the first base sequence after replacement is obtained as follows: "AC CT GA GC CT TG AT TT TG AC GA AT TA GC GA GT GA GG AC CT GC TG GT AG GA GA AG TG AG AC AG CT GG AG GC TG AT AG GG GA AC GG CT GT CT GA GA GC AG AG AG AG AG GA AC AG GC GT TG GC".

In order to ensure that the replaced first base sequence meets the requirement of GC content, the base sequence of the replaced first base sequence can be continued, for example, GC is replaced by TA, so as to obtain the target base sequence: "AC CT GA TA CT TG AT TT TG AC GA AT TA TA GA GT GA GG AC CT TA TG GT AG GA GA AG TG AG AC AG CT GG AG TA TG AT AG GG GA AC GG CT GT CT GA GA TA AG AG AG AG AG GA AC AG TA GT TG TA".

After the target base sequence finally encoded is obtained, a DNA can be generated based on the target base sequence. In addition, in consideration of the restriction of the synthetic length of DNA, it is also possible to first segment the target base sequence and then synthesize DNA based on the segmented base sequence. As an example, the target base sequence is segmented, and the process of synthesizing DNA based on the segmented base sequence may include the following steps 106-108.

Step 106: and carrying out segmentation treatment on the target base sequence to obtain a plurality of first segmented base sequences.

Wherein the sequence length of the first segmented base sequence may be less than or equal to a length threshold. The length threshold may be preset, for example, the preset length may be 50 or 60, etc.

As an example, the target base sequence may be segmented based on a length threshold, resulting in a plurality of first segmented base sequences, such that the sequence length of each first segmented base sequence is less than or equal to the length threshold.

Step 107: and adding segmentation information to each first segmentation base sequence in the plurality of first segmentation base sequences to obtain a plurality of second segmentation base sequences.

Wherein the segment information may include at least one of an addition primer, address information, and an error correction code. A primer refers to a macromolecule having a specific nucleotide sequence that stimulates synthesis at the start of nucleotide polymerization, and is usually two oligonucleotide sequences synthesized artificially. For example, the primers include a first primer and a second primer. The first primer may be a base sequence CCAGTAT and the second primer a base sequence AGTATCA. The address information is used to indicate the position of the first segment base sequence in the file corresponding to the target base sequence. The error correction code refers to a code that can automatically correct errors occurring during data storage at a decoding end.

As an example, the added address information may be address information in a base sequence form obtained by converting decimal address information into ternary data and encoding the converted ternary data according to a preset correspondence.

Further, the segment information may further include at least one of direction information and file type information. Wherein the direction information is used to indicate a sequence direction of the first segmented base sequence, and the file type information is used to indicate whether the first segmented base sequence belongs to a program file or a data file.

Referring to FIG. 3, FIG. 3 is a diagram of a second segmented base sequence provided in the examples of the present application. As shown in FIG. 3, the second segment base sequence includes a primer, address information, error correction code, direction information, file type information, and the first segment base sequence. Wherein, the primer comprises a base sequence CCAGTAT and a base sequence AGTATCA. The address information is a base sequence TAC. The direction information is a and T. The file type information is C. The first split base sequence is TACGAGCAT.

The plurality of second segmented base sequences are obtained by respectively adding the segmentation information on each first segmented base sequence in the plurality of first segmented base sequences, so that the plurality of first segmented base sequences can be obtained based on the segmentation information in the plurality of second segmented base sequences in the decoding process, and the plurality of first segmented base sequences can be spliced.

Step 108: a plurality of DNAs are generated based on the plurality of second fragmented base sequences.

That is, one DNA may be generated based on each of the plurality of second segmented base sequences, respectively, to obtain a plurality of artificially synthesized DNAs to store the plurality of second segmented base sequences in the plurality of DNAs, respectively.

It should be noted that the preset full ternary tree described in the embodiment of the present application is a full ternary tree defined by a fixed rule, and for example, the preset full ternary tree may be directly constructed based on a default encoding rule during encoding. The encoding rule refers to an encoding method described in the embodiment of the present application, and may be obtained by pre-configuration or pre-negotiation. Any binary data to be coded can be coded according to the coding rule. Moreover, because the encoding rule is simple, the encoding rule does not need to be stored in the encoded data of the DNA or outside the encoded data, but the encoding rule can be used as a default encoding rule for encoding binary data into ternary data, so that the binary data to be encoded can be encoded according to the encoding rule by default, and the encoding efficiency can be improved under the condition of ensuring the reliability of the data.

In the embodiment of the application, a complete ternary tree with a fixed rule is preset, so that when binary data are coded, a coding table can be directly constructed on the basis of the complete ternary tree with the fixed rule and character frequencies in a character sequence corresponding to the binary data, then the binary data are coded into ternary data on the basis of the constructed coding table, and the ternary data are coded into a base sequence on the basis of a preset corresponding relation. Because the complete ternary tree with the preset fixed rule is used for constructing the coding table for coding, the Huffman tree with the minimum weighted path length does not need to be constructed based on the character frequency during coding, the coding rule is simpler, and the coding efficiency is improved.

Fig. 4 is a flowchart of a decoding method applied to a decoding apparatus according to an embodiment of the present application, where the decoding method is applied to a decoding apparatus, and the encoding apparatus may be a computer device such as a terminal or a server, and the terminal includes but is not limited to a mobile phone, a tablet computer, a desktop computer, a Personal Digital Assistant (PDA), and the like, which is not limited in the embodiment of the present application. Referring to fig. 4, the method includes:

step 401: and acquiring a target base sequence to be decoded.

The decoding process described in the present embodiment is a reverse process of the encoding process described in the embodiment of fig. 1, and is used to decode the base sequence stored in the DNA into binary data. Before decoding, the target base sequence to be decoded needs to be obtained.

Wherein the target nucleotide sequence includes A, G, T, C four nucleotides. The target nucleotide sequence may be obtained from 1 DNA or a plurality of DNAs, and the present application is not limited thereto.

As an example, a segmented base sequence stored in each of a plurality of DNAs may be obtained to obtain a plurality of segmented base sequences, and then the plurality of segmented base sequences may be subjected to a splicing process to obtain a target base sequence.

For example, a plurality of second segment base sequences may be obtained by obtaining a second segment base sequence based on each of a plurality of DNAs, that is, obtaining a second segment base sequence stored in each of the plurality of DNAs, where each of the second segment base sequences includes a first segment base sequence and segment information, and the segment information includes at least one of a primer, address information, and an error correction code. Then, according to the segmentation information in the plurality of second segmented base sequences, a first segmented base sequence in the plurality of second segmented base sequences is obtained, and the first segmented base sequence in the plurality of second segmented base sequences is spliced to obtain a target base sequence.

For example, the target nucleotide sequence can be obtained by splicing a plurality of the segmented nucleotide sequences as follows: "AC CT GA GC CT TG AT CA TT AC GA AT AA GC GA GT GA GG AC CT GC TG GT AG GA TT AG TG AG AC AG CT GG AG GC TG AT AG GG GA AC GG CT GT CT GA GA AA AG AG AG AG AG GA AC AG GC GT TG AA".

Step 402: and decoding the target base sequence according to the preset corresponding relation, and determining target ternary data based on the decoding result.

The preset correspondence is used for indicating the correspondence between the ternary data and the base pairs, and can be used for encoding the base sequences into the ternary data. The specific content of the preset corresponding relationship may refer to the related description in the embodiment of fig. 1, and the embodiment of the present application is not described herein again.

As an example, the target base sequence may be decoded according to a preset correspondence to obtain splicing ternary data; and acquiring the first data, the second data and the target ternary data from the spliced ternary data. The first data is obtained by performing ternary conversion on the number of characters, and the second data is obtained by performing ternary conversion on a plurality of characters in the order from large to small according to the frequency of the characters.

In addition, the number of characters and the characters sorted according to the frequency of the characters from large to small can be respectively determined according to the acquired first data and the acquired second data, so that the target ternary data can be decoded according to the characters sorted according to the frequency of the characters from large to small.

As another example, the base substitution may be performed on the target base sequence according to a reverse substitution rule of a preset substitution rule to obtain a first base sequence, and then the first base sequence may be decoded according to a preset correspondence relationship.

The specific content of the preset replacement rule may refer to the description related to the embodiment in fig. 1, and the embodiment of the present application is not described herein again. For example, the reverse replacement rule of the preset replacement rule may include at least one of the following replacement rules: replacing AA with GT; replacing TT with GA; replacing CC with GG; replacing TA by GC; CA is replaced by GT.

The method for decoding the first base sequence according to the preset correspondence is the same as the method for decoding the target base sequence according to the preset correspondence, and the specific decoding process may refer to the description of decoding the target base sequence according to the preset correspondence, which is not repeated herein in this embodiment of the present application.

For example, according to the reverse substitution rule of the preset substitution rule E in the example of FIG. 1, the following first base sequence can be obtained by substituting the related base in the target base sequence for the original base:

“AC CT GA GC CT TG AT GA TG AC GA AT GC GC GA GT GA GG AC CT GC TG GT AG GA TG AG TG AG AC AG CT GG AG GC TG AT AG GG GA AC GG CT GT CT GA GA GC AG AG AG AG AG GA AC AG GC GT TG GC”。

assuming that the predetermined correspondence relationship is as shown in table 7, the following splicing ternary data can be obtained by decoding the first base sequence according to the predetermined correspondence relationship shown in table 7:

“111020011021002021112000010120022012111001210222202122212211221012220121002212201112100210202001222222222220112201022101”。

then, the first data, the second data, and the target ternary data may be obtained from the concatenated ternary data. As an example, the operation of obtaining the first data, the second data, and the target ternary data from the concatenated ternary data includes: the first m bits of data in the spliced ternary data are acquired as first data, the number of characters is determined based on the first data, and then second data is acquired from the remaining data of the spliced ternary data based on the number of characters. Then, data subsequent to the second data is acquired as target ternary data.

Wherein m is a positive integer. m may be preset, for example m may be 3 or 6. The first data may be converted into decimal data, and the converted decimal data may be determined as the number of characters.

For example, the first 3 bits "111" in the concatenated ternary data may be obtained as the first data, the 3 bits "111" may be converted into the decimal data 13, and the number of characters may be determined to be 13. Then, 13 × 3-bit ternary data "020011021002021112000010120022012111001" following the first 3 bits "111" is obtained as second data, and the second data is converted into a character sequence, resulting in 13 characters "64729 e 03 f 85 d 1" arranged in the order of the frequency of the characters from front to back.

In addition, the data subsequent to the second data may be acquired as target ternary data, resulting in the following target ternary data:

“210222202122212211221012220121002212201112100210202001222222222220112201022101”。

step 403: and decoding the target ternary data according to the ternary codes of all leaf nodes in the preset complete ternary tree to obtain the character sequence.

Wherein the predetermined complete ternary tree is comprised of 2nAnd (3) in the complete ternary tree of each leaf node, the path between the parent node and the child node in the complete ternary tree is preset to be sequentially coded into 0,1 and 2 from left to right, and the ternary code of each leaf node in the complete ternary tree is preset to be the code of the path from the root node to each child node. Where n is the number of bits of the coding unit of binary data, and may be set in advance, for example, 4 or 8.

As an example, a first encoding table may be constructed according to a plurality of characters ordered from large to small according to the frequency of the characters and the ternary encoding of each leaf node in the preset full ternary tree, where the first encoding table is used for indicating the ternary encoding corresponding to each character in the plurality of characters. And then coding the target ternary data according to the first coding table to obtain the character sequence.

It should be noted that, constructing the first coding table according to the multiple characters ordered from large to small according to the frequency of the characters and the ternary codes of each leaf node in the preset complete ternary tree may refer to the related description of step 103 in the embodiment of fig. 1, which is not described herein again in this embodiment of the present application.

As an example, if the plurality of characters sorted in the order of the frequencies of the characters from large to small is "64729 e 03 f 85 d 1" and the full ternary tree is preset as the full ternary tree including 16 nodes, the first encoding table may be constructed as the encoding table shown in table 2 above according to the plurality of characters sorted in the order of the frequencies of the characters from large to small and the ternary encoding of each leaf node in the full ternary tree.

Thereafter, the target ternary data "210222202122212211221012220121002212201112100210202001222222222220112201022101" may be encoded according to the first encoding table shown in table 2, resulting in the character sequence "436 f 64696 e 6720627920487566666 d 616 e".

Step 404: binary data corresponding to the character sequence is determined.

That is, the character sequence may be converted into binary data, thereby obtaining decoded binary data. For example, if the character sequence is ASCII characters, the character sequence in the form of ASCII characters may be converted into binary data.

In the embodiment of the present application, a complete ternary tree with a fixed rule is preset, so that when a base sequence is decoded, the base sequence is decoded according to a preset correspondence to obtain ternary data, and the base sequence can be directly decoded according to the complete ternary tree with the fixed rule to obtain the ternary data. Because the complete ternary tree with the preset fixed rule is used for constructing the coding table for decoding, the Huffman tree with the minimum weighted path length does not need to be constructed based on the character frequency during decoding, the decoding rule is simpler, and the decoding efficiency is improved. Moreover, since the decoding rule is simple, the decoding rule does not need to be stored in the encoded data of the DNA or outside the encoded data, but can be used as a default decoding rule for encoding binary data into ternary data, so that the base sequence to be decoded can be decoded by default according to the decoding rule, and the decoding efficiency can be improved while ensuring the reliability of the data.

Fig. 5 is a block diagram of an encoding apparatus 500 according to an embodiment of the present disclosure, which may be integrated in a computer device. As shown in fig. 5, the apparatus includes:

a first determining module 501, configured to determine a character sequence corresponding to binary data to be encoded;

a second determining module 502, configured to determine a plurality of characters appearing in the character sequence and a frequency of each character, where the frequency of each character is used to indicate the number of times each character appears in the character sequence;

a first constructing module 503, configured to construct a first code table according to the frequencies of the multiple characters and the ternary codes of each leaf node in the preset full ternary tree; wherein the predetermined complete ternary tree is comprised of 2nA full ternary tree of leaf nodes, wherein paths between a father node and a son node in the preset full ternary tree are sequentially encoded to be 0,1 and 2, the first encoding table is used for indicating ternary codes corresponding to each character in the plurality of characters, and n is the bit number of an encoding unit of the binary data;

a first encoding module 504, configured to encode the character sequence according to the first encoding table to obtain target ternary data;

a second encoding module 505, configured to encode the target ternary data according to a preset corresponding relationship, and determine a target base sequence according to the encoding result, where the preset corresponding relationship is used to indicate a corresponding relationship between the ternary data and base pairs.

Optionally, the first building block 503 is configured to:

sorting the characters according to the sequence of the frequencies of the characters from large to small to obtain the sorted characters;

according to the sequence of the code length from short to long, determining a plurality of ternary codes which are ordered at the top from the ternary codes of each leaf node in the preset complete ternary tree, wherein the number of the ternary codes is the same as that of the characters;

sequencing the plurality of ternary codes according to the sequence of the code length from short to long to obtain the sequenced plurality of ternary codes;

establishing a one-to-one correspondence relationship between the sorted characters and the sorted ternary codes to obtain the first code table; the greater the frequency of the characters in the first coding table is, the shorter the ternary codes corresponding to the characters are.

Optionally, the constructing a first code table according to the frequencies of the plurality of characters and the ternary codes of the leaf nodes in the preset full ternary tree includes:

sequentially numbering each leaf node in the preset complete ternary tree according to the coding length of the corresponding ternary code to obtain the serial number of each leaf node;

establishing a corresponding relation between the sequence number of each leaf node and the ternary code to obtain a second code table;

and constructing the first encoding table according to the frequencies of the characters and the second encoding table.

Optionally, the first determining module 501 is configured to:

the binary data is converted to ASCII characters, resulting in the character sequence.

Optionally, the preset correspondence includes a plurality of two-bit ternary data and a base pair corresponding to each two-bit ternary data.

Optionally, the preset correspondence includes at least one of the following correspondences:

00 corresponds to AT;

11 corresponds to AC;

22 corresponds to AG;

01 corresponds to GC;

02 corresponds to GT;

10 corresponds to CT;

20 corresponds to GA;

12 corresponds to GG;

21 corresponds to TG.

Optionally, the second encoding module 505 is configured to:

the coding unit is used for coding the target ternary data according to the preset corresponding relation to obtain a first base sequence;

a replacing unit, configured to replace an initiation codon appearing in the first base sequence to obtain a second base sequence, where the second base sequence does not include the initiation codon;

a determining unit for determining the target base sequence based on the second base sequence.

Alternatively, the initiation codon is a three-position designated base sequence; the replacement unit is to:

and according to a preset replacement rule, performing base replacement on the last two bases in the two base pairs with the initiation codon in the first base sequence until the first base sequence after the base replacement does not contain the initiation codon.

Optionally, the preset replacement rule includes at least one of the following replacement rules:

replacing GT with AA;

replacing GA with TT;

GG is replaced by CC;

GC was replaced by TA;

GT was replaced with CA.

Optionally, the determining unit is configured to:

performing base substitution on the second base sequence according to a preset substitution target and a preset substitution rule, wherein the preset substitution target comprises at least one of GC content requirements and GC distribution requirements;

and determining the second base sequence after replacement which meets the preset replacement target as the target base sequence.

Optionally, the apparatus further comprises:

the conversion module is used for converting the number of the characters into ternary data to obtain first data, wherein the number of the characters refers to the number of various characters; converting the characters in the sequence from large frequency to small frequency into ternary data to obtain second data;

the splicing module is used for splicing the first data, the second data and the target ternary data to obtain spliced ternary data;

the second encoding module 505 is configured to encode the splicing ternary data according to the preset corresponding relationship.

Optionally, the apparatus further comprises a second building block configured to:

construction of a column containing 2nA full ternary tree of leaf nodes;

and sequentially coding paths between the father node and the child nodes in the full ternary tree into 0,1 and 2 according to the sequence from left to right to obtain the preset full ternary tree, wherein the ternary code of each leaf node in the preset full ternary tree is a path code from a root node to each child node.

Optionally, the apparatus further comprises:

a segmentation module, configured to perform segmentation processing on the target base sequence to obtain a plurality of first segmented base sequences;

an adding module, configured to add segmentation information to each of the plurality of first segment base sequences to obtain a plurality of second segment base sequences, where the segmentation information includes at least one of a primer, address information, and an error correction code;

a storage module for generating a plurality of DNAs based on the plurality of second segmented base sequences.

In the embodiment of the application, a complete ternary tree with a fixed rule is preset, so that when binary data are coded, a coding table can be directly constructed on the basis of the complete ternary tree with the fixed rule and character frequencies in a character sequence corresponding to the binary data, then the binary data are coded into ternary data on the basis of the constructed coding table, and the ternary data are coded into a base sequence on the basis of a preset corresponding relation. Because the complete ternary tree with the preset fixed rule is used for constructing the coding table, a Huffman tree with the minimum weighted path length does not need to be constructed based on the character frequency during coding, the coding rule is simpler, and the coding efficiency is improved.

It should be noted that: in the encoding apparatus provided in the above embodiment, only the division of the functional modules is illustrated, and in practical applications, the functions may be allocated to different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the functions described above. In addition, the encoding apparatus and the encoding method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 6 is a block diagram of a decoding apparatus 600 according to an embodiment of the present application, which may be integrated in a computer device. As shown in fig. 6, the apparatus includes:

an obtaining module 601, configured to obtain a target base sequence to be decoded;

a first decoding module 602, configured to decode the target base sequence according to a preset correspondence, and determine target ternary data based on a decoding result, where the preset correspondence is used to indicate a correspondence between the ternary data and base pairs;

a second decoding module 603, configured to decode the target ternary data according to the ternary codes of each leaf node in a preset complete ternary tree, so as to obtain a character sequence, where the preset complete ternary tree includes 2nA complete ternary tree of leaf nodes, wherein paths between a father node and a son node in the preset complete ternary tree are sequentially encoded into 0,1 and 2, and n is the bit number of an encoding unit of the binary data;

a first determining module 604, configured to determine binary data corresponding to the character sequence.

Optionally, the obtaining module 601 is configured to:

acquiring a second segmented base sequence stored in each DNA in the plurality of DNAs to obtain a plurality of second segmented base sequences, wherein each second segmented base sequence comprises a first segmented base sequence and segmentation information, and the segmentation information comprises at least one of a primer, address information and an error correction code;

and acquiring a first segmented base sequence in the plurality of second segmented base sequences according to the segmentation information in the plurality of second segmented base sequences, and performing splicing treatment on the first segmented base sequence in the plurality of second segmented base sequences to obtain the target base sequence.

Optionally, the first decoding module 602 is configured to:

decoding the target base sequence according to the preset corresponding relation to obtain splicing ternary data;

and acquiring first data, second data and the target ternary data from the spliced ternary data, wherein the first data is obtained by carrying out ternary conversion on the number of characters, and the second data is obtained by carrying out ternary conversion on various characters in the sequence from large to small according to the frequency of the characters.

Optionally, the apparatus further comprises:

the second determining module is used for respectively determining the number of the characters and various characters which are sequenced from large to small according to the frequency of the characters according to the first data and the second data;

the second decoding module 603 is configured to:

constructing a first coding table according to a plurality of characters which are ordered according to the frequency of the characters from large to small and the ternary codes of all leaf nodes in the preset complete ternary tree, wherein the first coding table is used for indicating the ternary codes corresponding to each character in the plurality of characters;

and coding the target ternary data according to the first coding table to obtain the character sequence.

Optionally, the apparatus further comprises:

a replacing module, which is used for carrying out base replacement on the target base sequence according to a reverse replacing rule of a preset replacing rule to obtain a first base sequence;

the first decoding module 602 is configured to decode the first base sequence according to the preset corresponding relationship.

Optionally, the preset replacement rule includes at least one of the following replacement rules:

replacing GT with AA;

replacing GA with TT;

GG is replaced by CC;

GC was replaced by TA;

GT was replaced with CA.

In the embodiment of the present application, a complete ternary tree with a fixed rule is preset, so that when a base sequence is decoded, the base sequence is decoded according to a preset correspondence to obtain ternary data, and the base sequence can be directly decoded according to the complete ternary tree with the fixed rule to obtain the ternary data. Because the complete ternary tree with the preset fixed rule is used for constructing the coding table for decoding, the Huffman tree with the minimum weighted path length does not need to be constructed based on the character frequency during decoding, the decoding rule is simpler, and the decoding efficiency is improved. Moreover, since the decoding rule is simple, the decoding rule does not need to be stored in the encoded data of the DNA or outside the encoded data, but can be used as a default decoding rule for encoding binary data into ternary data, so that the base sequence to be decoded can be decoded by default according to the decoding rule, and the decoding efficiency can be improved while ensuring the reliability of the data.

It should be noted that: in the decoding device provided in the above embodiment, only the division of the above functional modules is taken as an example for decoding, and in practical applications, the above function distribution may be completed by different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the decoding apparatus and the decoding method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 7 is a block diagram of a computer device 700 according to an embodiment of the present disclosure. The computer device 700 may be an encoding apparatus or a decoding apparatus. By way of example, the computer device 700 may be an electronic device such as a mobile phone, a tablet computer, a smart television, a multimedia player device, a wearable device, a desktop computer, a server, and so on. The computer device 700 may be used to implement the encoding method or the decoding method provided in the above-described embodiments.

Generally, the computer device 700 includes: a processor 701 and a memory 702.

The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the encoding method or decoding method provided by method embodiments herein.

In some embodiments, the computer device 700 may also optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device may include: at least one of a display screen 704, an audio circuit 705, a communication interface 706, and a power supply 707.

Those skilled in the art will appreciate that the configuration illustrated in FIG. 7 is not intended to be limiting of the computer device 700 and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components may be employed.

In an exemplary embodiment, there is also provided a computer-readable storage medium having stored thereon instructions that, when executed by a processor, implement the above-described encoding method or decoding method.

In an exemplary embodiment, there is also provided a computer program product for implementing the above-described encoding method or decoding method when the computer program product is executed.

In an exemplary embodiment, a coding and decoding system is further provided, where the coding system includes a coding apparatus and a decoding apparatus, the coding apparatus is configured to implement the coding method, and the decoding apparatus is configured to implement the decoding method.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

42页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:信息显示方法、装置、设备和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!