Method and device for compressing and decompressing extensible markup language (XML) document

文档序号:1087373 发布日期:2020-10-20 浏览:6次 中文

阅读说明:本技术 一种可扩展标记语言xml文档的压缩、解压方法和装置 (Method and device for compressing and decompressing extensible markup language (XML) document ) 是由 薛军超 于 2020-05-18 设计创作,主要内容包括:本发明提出了一种可扩展标记语言XML文档的压缩、解压方法和装置,包括以下步骤:接收并读取压缩前XML文档;分别对XML文档中的标记文字串和属性文字串进行使用次数统计;根据统计结果,将所述标记文字串和属性文字串分别进行排序;根据排序结果,将所述标记文字串或属性文字串与映射字符进行依次映射,并建立映射表;根据所述映射表,将XML文档中的所述标记文字串或属性文字串与对应的所述映射字符进行替换,获得压缩后XML文档。本发明通过对大量重复出现的、较长的标记和属性字符串进行单字节、最多双字节映射,从而达到压缩的目的,显著减少了标记及属性重复文字串的存储占用空间,因此达到了压缩的效果。(The invention provides a method and a device for compressing and decompressing an extensible markup language (XML) document, which comprise the following steps of: receiving and reading an XML document before compression; respectively carrying out use frequency statistics on the marked text strings and the attribute text strings in the XML documents; according to the statistical result, sorting the marked character strings and the attribute character strings respectively; according to the sorting result, mapping the marked character string or the attribute character string with the mapping character in sequence, and establishing a mapping table; and replacing the marked character string or the attribute character string in the XML document with the corresponding mapping character according to the mapping table to obtain the compressed XML document. The invention achieves the purpose of compression by mapping a large number of repeated and longer marks and attribute character strings with single byte and at most double bytes, and obviously reduces the storage occupation space of the repeated mark and attribute character strings, thereby achieving the effect of compression.)

1. A compression method of extensible markup language XML document is characterized by comprising the following steps:

receiving and reading an XML document before compression;

respectively carrying out use frequency statistics on the marked text strings and the attribute text strings in the XML documents;

according to the statistical result, sorting the marked character strings and the attribute character strings respectively;

according to the sorting result, mapping the marked character string or the attribute character string with the mapping character in sequence, and establishing a mapping table;

and replacing the marked character string or the attribute character string in the XML document with the corresponding mapping character according to the mapping table to obtain the compressed XML document.

2. The method of compressing XML documents according to claim 1, wherein said sorting the markup text strings and the attribute text strings respectively comprises: respectively sorting the marked literal strings and the attribute literal strings from high to low according to the using times;

and for the marked character strings or the attribute character strings with the same use times, sorting according to the quantity of characters of the corresponding character strings.

3. The method of compressing XML documents according to claim 1, wherein said statistics of number of usage is performed by a counter, and each time said tag or attribute string occurs, the number of corresponding strings is increased by one.

4. The method of compressing extensible markup language (XML) documents according to claim 1, wherein said mapping tables include a tag mapping table and an attribute mapping table, said tag mapping table and said attribute mapping table being stored in a compressed package as separate files.

5. The method of compressing XML document according to claim 1, wherein said mapping characters are single-byte characters, and the number of said mapping characters is 115, which conforms to XML standard definition and can be used as tag names.

6. The method of compressing an extensible markup language (XML) document according to claim 1, wherein said mapping characters are double-byte characters, the first byte of said double-byte characters is an underline character "_" and the second byte thereof is a single-byte character;

or the first byte and the second byte of the double-byte character are both single-byte characters.

7. A decompression method of an extensible markup language XML document is characterized by comprising the following steps:

acquiring a compressed packet, wherein the compressed packet comprises a compressed XML document and a mapping table;

reading the compressed XML document and the mapping table in the compressed package;

searching a mapping relation between the marking character string or the attribute character string and the mapping character in the mapping table;

and replacing the marked character string and the attribute character string with the corresponding mapping characters respectively according to the mapping relation to obtain the XML document before compression.

8. An apparatus for compressing an extensible markup language (XML) document, comprising:

the receiving and reading module is used for receiving and reading the XML document before compression;

the times counting module is used for respectively counting the use times of the marked character strings and the attribute character strings in the XML documents;

the sorting module is used for sorting the marked character strings and the attribute character strings respectively according to the statistical result;

the mapping module is used for sequentially mapping the marked character string or the attribute character string and the mapping character according to the sequencing result and establishing a mapping table;

and the replacing module is used for replacing the marked character string or the attribute character string in the XML document with the corresponding mapping character according to the mapping table to obtain the compressed XML document.

9. The apparatus for decompressing extensible markup language (XML) document according to claim 8, wherein said mapping table comprises a tag mapping table and an attribute mapping table, further comprising a storage module for storing said tag mapping table and said attribute mapping table in a compressed package as separate files.

10. An apparatus for decompressing an XML document of an XML document, comprising:

the compressed packet acquisition module is used for acquiring a compressed packet, and the compressed packet comprises a compressed XML document and a mapping table;

the compressed package reading module is used for reading the compressed XML document and the mapping table in the compressed package;

the mapping relation searching module is used for searching the mapping relation between the marked character string or the attribute character string and the mapping character in the mapping table;

and the mapping character replacing module is used for replacing the marking character string and the attribute character string with the corresponding mapping characters respectively according to the mapping relation to obtain the XML document before compression.

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for compressing and decompressing an extensible markup language (XML) document.

Background

The OFD (Open file-layout Document, abbreviation: OFD) is the format of the format Document of our country defined by the national standard GB/T33190-2016 electronic file storage and exchange format Document of our country. The OFD meets the requirement of informatization construction in China, and is becoming one of the basic standards of informatization application of various industries in China. By the layout document, the page content described by the layout document has fixed appearance information such as position, size, color and the like, and the display effect of the page content on all terminals is completely fixed and consistent.

The OFD is a compressed package file in the ZIP compressed format with a file suffix ". OFD", so that the OFD is essentially a ZIP compressed package. According to the standard definition, the OFD contains two types of data files: a is the file and page information stored in XML format, it defines the basic information of the format file; and the other is resource information stored in other formats, such as a resource file in a font, an image, multimedia, and the like.

XML, Extensible Markup Language (XML), is an international popular universal data description Language format, and has many advantages such as simplicity, easy understanding, and easy application. The OFD adopts the basic information description format as a format document, and can be said to be the core data format of the OFD.

The XML data marking method is also insufficient, in the OFD, a large number of XML files adopt the same data marking method, and these data marks are readable character strings, that is, the OFD includes a large number of repeated character strings, which are formed due to the format characteristics of the XML marking language, resulting in a large volume of OFD documents, and a large amount of system memory and storage space is consumed when compressing or decompressing.

Disclosure of Invention

In view of the above problems, the present invention provides a method and apparatus for compressing and decompressing an XML document of an XML document.

In order to solve the technical problems, the invention adopts the technical scheme that: a compression method of extensible markup language XML document comprises the following steps: receiving and reading an XML document before compression; respectively carrying out use frequency statistics on the marked text strings and the attribute text strings in the XML documents; according to the statistical result, sorting the marked character strings and the attribute character strings respectively; according to the sorting result, mapping the marked character string or the attribute character string with the mapping character in sequence, and establishing a mapping table; and replacing the marked character string or the attribute character string in the XML document with the corresponding mapping character according to the mapping table to obtain the compressed XML document.

As a preferred scheme, the sorting the marked text strings and the attribute text strings respectively specifically includes: respectively sorting the marked literal strings and the attribute literal strings from high to low according to the using times; and for the marked character strings or the attribute character strings with the same use times, sorting according to the quantity of characters of the corresponding character strings.

Preferably, the statistics of the number of times of use is performed by using a counter, and each time the marked text string or the attribute text string appears, one is added to the number of corresponding text strings.

Preferably, the mapping table includes a tag mapping table and an attribute mapping table, and the tag mapping table and the attribute mapping table are stored in the compressed packet in the form of independent files.

Preferably, the mapping characters are single-byte characters, and the number of the mapping characters is 115, which conforms to the XML standard definition and can be used as a tag name.

Preferably, the mapping character is a double-byte character, the first byte of the double-byte character is an underline character _', and the second byte of the double-byte character is a single-byte character; or the first byte and the second byte of the double-byte character are both single-byte characters.

According to another aspect of the present invention, there is provided a method for decompressing an XML document of an XML document, including the steps of: acquiring a compressed packet, wherein the compressed packet comprises a compressed XML document and a mapping table; reading the compressed XML document and the mapping table in the compressed package; searching a mapping relation between the marking character string or the attribute character string and the mapping character in the mapping table; and replacing the marked character string and the attribute character string with the corresponding mapping characters respectively according to the mapping relation to obtain the XML document before compression.

According to another aspect of the present invention, there is provided an apparatus for compressing an XML document, including: the receiving and reading module is used for receiving and reading the XML document before compression; the times counting module is used for respectively counting the use times of the marked character strings and the attribute character strings in the XML documents; the sorting module is used for sorting the marked character strings and the attribute character strings respectively according to the statistical result; the mapping module is used for sequentially mapping the marked character string or the attribute character string and the mapping character according to the sequencing result and establishing a mapping table; and the replacing module is used for replacing the marked character string or the attribute character string in the XML document with the corresponding mapping character according to the mapping table to obtain the compressed XML document.

Preferably, the mapping table includes a tag mapping table and an attribute mapping table, and further includes a storage module, configured to store the tag mapping table and the attribute mapping table in an independent file form in the compressed packet.

According to another aspect of the present invention, there is provided an apparatus for decompressing an XML document of an XML document, including: the compressed packet acquisition module is used for acquiring a compressed packet, and the compressed packet comprises a compressed XML document and a mapping table; the compressed package reading module is used for reading the compressed XML document and the mapping table in the compressed package; the mapping relation searching module is used for searching the mapping relation between the marked character string or the attribute character string and the mapping character in the mapping table; and the mapping character replacing module is used for replacing the marking character string and the attribute character string with the corresponding mapping characters respectively according to the mapping relation to obtain the XML document before compression.

Compared with the prior art, the invention has the beneficial effects that: the invention carries out single-byte and at most double-byte mapping on a large number of repeated and longer marks and attribute character strings, thereby achieving the purpose of compression, obviously reducing the storage occupation space of the mark and attribute repeated character strings, and achieving the effect of compression; the mark and attribute character strings are counted and sequenced, then mapping replacement is carried out, the character strings with high occurrence frequency preferentially map single characters, and the storage space occupation can be further reduced.

Drawings

The disclosure of the present invention is illustrated with reference to the accompanying drawings. It is to be understood that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention. In the drawings, like reference numerals are used to refer to like parts. Wherein:

FIG. 1 is a flow chart of a method for compressing XML documents according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a compressing apparatus for XML documents according to an embodiment of the present invention;

FIG. 3 is a flowchart of a decompression method of an XML document according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an apparatus for decompressing an XML document according to an embodiment of the present invention.

Detailed Description

It is easily understood that according to the technical solution of the present invention, a person skilled in the art can propose various alternative structures and implementation ways without changing the spirit of the present invention. Therefore, the following detailed description and the accompanying drawings are merely illustrative of the technical aspects of the present invention, and should not be construed as all of the present invention or as limitations or limitations on the technical aspects of the present invention.

An embodiment according to the present invention is shown in connection with fig. 1. A compression method of extensible markup language XML document comprises the following steps:

s101: and receiving and reading the XML document before compression, wherein the XML document comprises a plurality of XML files, and each XML file comprises a mark character string and an attribute character string.

S102: and respectively carrying out use frequency statistics on the marked character string and the attribute character string in the XML document. The method specifically comprises the following steps: and counting the using times of the marked character strings or the attribute character strings in the XML file by adopting a counter, and adding one to the number of the corresponding character strings every time the marked character strings or the attribute character strings appear until all the XML files are completely counted.

S103: and according to the statistical result, respectively sequencing the marked character strings and the attribute character strings, wherein the sequencing order is performed from high to low according to the use times, and for the marked character strings or the attribute character strings with the same use times, sequencing is performed according to the quantity of characters of corresponding character strings, and the sequencing order can be performed according to ascending order or descending order.

S104: and according to the sequencing result, sequentially mapping the marked character string or the attribute character string and the mapping character, and establishing a mapping table, wherein the mapping table comprises a marked mapping table and an attribute mapping table, and the marked mapping table and the attribute mapping table are stored in the compressed packet in an independent file form.

In this embodiment, the mapping characters include single-byte characters and double-byte characters, and there are 115 single-byte mapping characters that conform to the XML standard definition and can be used as tag names, as shown in table 1.

ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcde

fghijklmnopqrstuvwxyz

Figure BDA0002495975800000041

Figure BDA0002495975800000042

Figure BDA0002495975800000043

TABLE 1

When the number of the tag strings or the attribute strings is not more than 115, the tag strings or the attribute strings can be mapped one by one according to the single-byte mapping characters in table 1, so that the multi-byte tag strings or the attribute strings in the XML document can be completely mapped into the single-byte mapping characters.

When the number of the marking character strings or the attribute character strings exceeds 115, the first 115 marking character strings or the attribute character strings are mapped one by one according to the table 1, the character strings after the 115 th character strings are mapped into double-byte characters, the first byte in the double-byte characters is fixed as an underline character _', and the second byte is still a single-byte mapping character in the table 1, so that 230 marking character strings or attribute character strings can be mapped at most.

Specifically, when the number of the tag strings or the attribute strings in the XML document exceeds 230, the second byte of the double-byte character is the sequential single-byte mapping character in table 1, and the first byte can be any single-byte mapping character, so that there can be 115+115 × 115 — 13340 mappings in total, which can meet the mapping requirements of most XML documents.

S105: and replacing and storing the marked character string or the attribute character string in the XML document and the corresponding mapping character according to the mapping table to obtain the compressed XML document.

The following takes the OFD file containing the XML document as an example to further explain a compression method of the XML document.

In the OFD file, 147 XML markup strings conforming to the OFD standard are shown in table 2 in total.

Figure BDA0002495975800000051

TABLE 2

In the OFD file, a total of 126 XML attribute strings conforming to the OFD standard are shown in table 3.

Figure BDA0002495975800000052

TABLE 3

In this embodiment, after counting the number of times of use of the tag text string and the attribute text string in the XML document, according to the statistical result, the tag text string and the attribute text string are sorted according to the number of times of use from high to low, and according to the sorting result, the tag text string or the attribute text string and the mapping character are mapped in sequence, and a mapping table is established, and the mapping table is stored in the OFD compressed packet in an independent file form (for example, "fm.

TABLE 4

And replacing the marked character string or the attribute character string in the XML document with the corresponding mapping character according to the mapping table to obtain the compressed XML document. That is, the mapping character "a" in table 4 replaces the markup string "OFD" in the XML document.

As shown in fig. 2, the present invention also discloses a compressing apparatus for XML document of extensible markup language, which comprises:

and the receiving and reading module 110 is used for receiving and reading the XML document before compression.

The times counting module 120 is configured to count the usage times of the marked text string and the attribute text string in the XML document, respectively.

The sorting module 130 sorts the marked text strings and the attribute text strings according to the statistical result.

The mapping module 140 sequentially maps the tag or attribute character strings with the mapping characters according to the sorting result, and establishes a mapping table, which includes a tag mapping table and an attribute mapping table.

The replacing module 150 replaces the marked text string or the attribute text string in the XML document with the corresponding mapping character according to the mapping table, so as to obtain the compressed XML document.

Further, a storage module 160 is included, and the storage module 160 is configured to store the tag mapping table and the attribute mapping table in a compressed package as independent files.

As shown in FIG. 3, the present invention also discloses a decompression method for XML documents, which comprises the following steps:

s201, obtaining a compressed packet, wherein the compressed packet comprises a compressed XML document and a mapping table.

S202, reading the compressed XML document and the mapping table in the compressed package, reading the mapping table file from the compressed package, wherein the mapping table file is the fm. Otherwise, decompression is carried out according to conventional steps, and subsequent steps do not need to be executed.

S203, the mapping relation between the mapping character and the marked character string or the attribute character string in the mapping table is searched.

S204, reading each XML file, respectively replacing the marked character strings and the attribute character strings with the corresponding mapping characters according to the mapping relation to obtain the XML document before compression, and then decompressing according to the conventional steps.

As shown in fig. 4, the present invention also discloses a compressing apparatus for XML document of extensible markup language, comprising:

and the compressed packet obtaining module 210 is configured to obtain a compressed packet, where the compressed packet includes the compressed XML document and the mapping table.

And the compressed package reading module 220 is used for reading the compressed XML document and the mapping table in the compressed package.

The mapping relation searching module 230 searches a mapping relation between the mapping character and the tag character string or the attribute character string in the mapping table.

And the mapping character replacing module 240 is configured to replace the tag character string and the attribute character string with respective corresponding mapping characters according to the mapping relationship, so as to obtain the XML document before compression.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

It is easy to understand that the method and apparatus for compressing and decompressing XML document of XML in XML extensible markup language proposed by the present invention are not limited to compressing and decompressing XML document in OFD file, but also can be extended to compressed package files containing a large number of XML documents, such as OFD, DOCX, XLSX, PPTX, etc., and can be extended to all XML, HTML, XHTML format or other similar markup language formats.

In summary, according to the method and the device for compressing and decompressing the XML document of the XML document, the single byte and at most double byte mapping replacement is performed on a large number of repeatedly occurring and long mark and attribute character strings, so that the storage space is significantly reduced, the purpose of compression is achieved, the mark and attribute character strings are counted and sorted, then mapping replacement is performed, the character strings with high occurrence frequency preferentially map single characters, and the storage space can be further reduced. Tests show that for OFD files, the storage space after compression is reduced by 20% to 50%, and the compression effect is more obvious for some special files.

The technical scope of the present invention is not limited to the above description, and those skilled in the art can make various changes and modifications to the above-described embodiments without departing from the technical spirit of the present invention, and such changes and modifications should fall within the protective scope of the present invention.

11页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:可控的基于风格的文本转换

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!