Method, device, equipment and storage medium for reading XML file

文档序号:1937611 发布日期:2021-12-07 浏览:8次 中文

阅读说明:本技术 一种读取xml文件的方法及装置、设备、存储介质 (Method, device, equipment and storage medium for reading XML file ) 是由 吴文兵 于 2020-06-01 设计创作,主要内容包括:本申请公开了一种读取XML文件的方法及装置、设备、存储介质,其中,所述方法包括:获取包括有M条XML记录的XML文件,M为大于等于1的整数;按照所述M条XML记录的先后顺序,利用N个Xpath依次对每一所述XML记录中的字段进行读取,对应得到N个字段,N为大于等于1的整数;按照所述M条XML记录的先后顺序,确定从每一所述XML记录读取的N个字段对应的行字符串,将每一所述行字符串转化为满足DataX协议的字符串,以供DataX的写线程处理。本申请提供的技术方案一方面利用Xpath读取XML字段可以处理复杂的XML数据;另一方面一条一条地顺序读取字符串,可以将读取的字符串转化为满足DataX协议的字符串,以供DataX的写线程处理。(The application discloses a method, a device, equipment and a storage medium for reading an XML file, wherein the method comprises the following steps: acquiring an XML file comprising M XML records, wherein M is an integer greater than or equal to 1; according to the sequence of the M XML records, sequentially reading the fields in each XML record by using N Xpaths to correspondingly obtain N fields, wherein N is an integer greater than or equal to 1; and according to the sequence of the M XML records, determining a line character string corresponding to the N fields read from each XML record, and converting each line character string into a character string meeting a DataX protocol for the processing of a DataX write thread. According to the technical scheme provided by the application, on one hand, complex XML data can be processed by reading the XML field by using Xpath; on the other hand, the character strings are sequentially read one by one, and the read character strings can be converted into character strings meeting the DataX protocol for being processed by a writing thread of the DataX.)

1. A method of reading an XML document, the method comprising:

acquiring an XML file comprising M XML records, wherein M is an integer greater than or equal to 1;

according to the sequence of the M XML records, sequentially reading the fields in each XML record by using N Xpaths to correspondingly obtain N fields, wherein N is an integer greater than or equal to 1;

and according to the sequence of the M XML records, determining a line character string corresponding to the N fields read from each XML record, and converting each line character string into a character string meeting a DataX protocol for the processing of a DataX write thread.

2. The method according to claim 1, wherein the sequentially reading the fields in each XML record by using N xpaths according to the sequence of the M XML records to obtain N fields correspondingly comprises:

assembling N Xpaths according to the reading requirement of the XML file;

when the M XML records are scanned in a stream scanning mode, the fields in each XML record are sequentially read by using N XPaths according to the sequence of the M XML records, and N fields are correspondingly obtained.

3. The method according to claim 2, wherein said assembling N xpaths according to the reading requirement of the XML file comprises:

determining N circular paths and N extraction fields which are in one-to-one correspondence with the N circular paths according to the reading requirement of the XML file;

and assembling an Xpath according to each of the N cyclic paths and the corresponding N extraction fields to obtain the N Xpaths.

4. The method according to claim 1, wherein the determining the line character strings corresponding to the N fields read from each XML record according to the sequence of the M XML records comprises:

determining N fields read from each XML record according to the sequence of the M XML records;

each of the N fields is converted into a corresponding line string with a separator character.

5. The method of claim 1, wherein converting each of the line strings into a string satisfying a DataX protocol for processing by a DataX writer thread comprises:

serializing each of the line strings into a byte sequence string;

and converting each byte sequence character string into a character string meeting the DataX protocol for the write thread processing of the DataX.

6. The method of claim 5, further comprising:

sequentially storing the M line character strings into a first blocking queue according to the sequence of obtaining the line character strings, wherein the first blocking queue is a queue supporting blocking operation;

and when the first blocking queue is determined to be not empty, sequentially extracting each line of character strings from the first blocking queue, and serializing each line of character strings into byte sequence character strings.

7. The method of claim 6, wherein converting each of the byte sequence strings into a string satisfying a DataX protocol for a DataX write thread to process comprises:

converting each byte sequence character string into a character string meeting a DataX protocol;

and storing the character strings meeting the DataX protocol into a second blocking queue, so that the writing thread of the DataX can sequentially extract each character string meeting the DataX protocol from the second blocking queue for processing.

8. An apparatus for reading an XML file, the apparatus comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring an XML file comprising M XML records, and M is an integer greater than or equal to 1;

the reading module is used for reading the fields in each XML record by using N XPaths in sequence according to the sequence of the M XML records to correspondingly obtain N fields, wherein N is an integer greater than or equal to 1;

and the determining module is used for determining the line character strings corresponding to the N fields read from each XML record according to the sequence of the M XML records, and converting each line character string into a character string meeting a DataX protocol for the processing of a writing thread of the DataX.

9. An apparatus for reading XML documents, comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and wherein the processor executes the program to implement the steps of the method for reading XML documents according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for reading an XML-document according to any one of claims 1 to 7.

Technical Field

The present application relates to database technology, and relates to, but is not limited to, a method, an apparatus, a device, and a storage medium for reading an XML file.

Background

The DataX is an offline synchronization tool for heterogeneous Data sources, and implements a stable and efficient Data synchronization function between various heterogeneous Data sources, including a relational database, a Hadoop Distributed File System (HDFS), Hive, Open Data Processing Service (ODPS), HBase, File Transfer Protocol (FTP), and the like. Wherein, Hive is a data warehouse frame constructed on Hadoop, and is a universal and telescopic data processing platform; HBase is a distributed, column-oriented open-ended database.

Currently, DataX supports extraction of FTP and HDFS, but has the following problems: first, reading only text files (TXT) is currently supported; secondly, the summary in the TXT file is a two-dimensional table, and the two-dimensional table does not support to store unstructured data such as Extensible Markup Language (XML), so that the method of supporting the TXT file reading by DataX cannot be directly applied to the reading of the XML file, where XML is a subset of standard universal Markup Language and is a Markup Language for marking an electronic file to make it have a structure.

Disclosure of Invention

In view of the above, the present application provides a method and an apparatus for reading an XML file, a device, and a storage medium to solve at least one problem in the prior art.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, the present application provides a method for reading an XML document, the method comprising:

acquiring an XML file comprising M XML records, wherein M is an integer greater than or equal to 1; according to the sequence of the M XML records, sequentially reading the fields in each XML record by using N Xpaths to correspondingly obtain N fields, wherein N is an integer greater than or equal to 1; and according to the sequence of the M XML records, determining a line character string corresponding to the N fields read from each XML record, and converting each line character string into a character string meeting a DataX protocol for the processing of a DataX write thread.

In a second aspect, the present application provides an apparatus for reading an XML file, the apparatus comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring an XML file comprising M XML records, and M is an integer greater than or equal to 1; the reading module is used for reading the fields in each XML record by using N XPaths in sequence according to the sequence of the M XML records to correspondingly obtain N fields, wherein N is an integer greater than or equal to 1; and the determining module is used for determining the line character strings corresponding to the N fields read from each XML record according to the sequence of the M XML records, and converting each line character string into a character string meeting a DataX protocol for the processing of a writing thread of the DataX.

In a third aspect, the present application provides an apparatus for reading an XML document, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the steps of the method for reading an XML document when executing the program.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described method of reading an XML document.

According to the method, an XML file needing to be read is firstly obtained, then, according to the sequence of M XML records, N XPaths are used for sequentially reading fields in each XML record to obtain N fields, and finally, the N fields read from each XML record are sequentially converted into corresponding character strings meeting the DataX protocol for the processing of the writing thread of the DataX. Therefore, according to the technical scheme provided by the application, on one hand, complex XML data can be processed by reading the XML field by using Xpath; on the other hand, the character strings are sequentially read one by one, and the read character strings can be converted into character strings meeting the DataX protocol for being processed by a writing thread of the DataX.

In some embodiments, it is first described how to assemble N circular paths and N decimation fields corresponding to the N circular paths in a one-to-one manner to obtain N xpaths according to the reading requirement of the XML file. Then, it is described that a document can be sequentially scanned by using an SAX element processing parser, and in the scanning process, according to the sequence of the M XML records, the fields in each XML record are sequentially read by using N xpaths, so as to correspondingly obtain N fields. Therefore, on one hand, in the scanning process, the SAX element processing parser is used for sequentially reading the fields in each XML record by using N XPaths according to the sequence of the M XML records, so that a subsequent field can be conveniently processed and read for the write thread processing of DataX; on the other hand, the element nodes and the attribute nodes can be conveniently extracted by reading XML by using Xpath.

In some embodiments, it is described that a field is first converted into a corresponding line string by a delimiter, then each of the line strings is serialized into a byte sequence string, and finally each of the byte sequence strings is converted into a string satisfying a DataX protocol for processing by a DataX write thread. Thus, the method of separator and serialization can be used for converting the field into the character string meeting the DataX protocol for the write thread processing of the DataX.

In some embodiments, processing data using a blocking queue is described. Therefore, the analyzed records from the XML file are stored in the blocking queue for consumption of the DataX reading thread, so that the DataX can read one record on the XML file without loading the XML file to the memory at one time, and the large text is prevented from being loaded to the memory at one time, so that the memory pressure of the DataX is reduced.

In some embodiments, processing data using a blocking queue is described. Therefore, the analyzed records from the XML file are stored in the blocking queue for consumption of the DataX reading thread, so that the DataX can read one record on the XML file without loading the XML file to the memory at one time, and the large text is prevented from being loaded to the memory at one time, so that the memory pressure of the DataX is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

fig. 1 is a schematic implementation flow diagram of a method for reading an XML file according to an embodiment of the present application;

fig. 2 is a schematic flow chart illustrating an implementation of another method for reading an XML document according to an embodiment of the present application;

fig. 3A is a schematic flow chart illustrating an implementation of another method for reading an XML file according to an embodiment of the present application;

fig. 3B is a schematic flow chart illustrating an implementation of another method for reading an XML file according to an embodiment of the present application;

fig. 4 is a schematic implementation flow chart of another method for reading an XML file according to an embodiment of the present application;

fig. 5 is a schematic structural diagram illustrating a structure of an apparatus for reading an XML document according to an embodiment of the present application;

fig. 6 is a schematic hardware entity diagram of an apparatus for reading an XML file according to an embodiment of the present application.

Detailed Description

An embodiment of the present application provides a method for reading an XML document, and fig. 1 is a schematic diagram illustrating an implementation flow of the method for reading an XML document according to the embodiment of the present application, as shown in fig. 1, where the method includes:

s101, obtaining an XML file comprising M XML records, wherein M is an integer greater than or equal to 1;

an XML file includes M XML records. To read an XML file, an XML file is first obtained.

Step S102, according to the sequence of the M XML records, sequentially reading the fields in each XML record by using N XPaths to correspondingly obtain N fields, wherein N is an integer greater than or equal to 1;

the Path Language (XML Path Language, XPath) is a Language used to determine the location of a part in an XML document. XPath is used to navigate through elements and attributes in XML documents. N XPaths indicate that N elements or attributes need to be read out of each XML record. According to the sequence of the M XML records, the method means that when the XML records are read, the document needs to be scanned sequentially, N Xpaths are used for reading the fields in each XML record in sequence, and each XML record can correspondingly obtain N fields.

Step S103, according to the sequence of the M XML records, determining a line character string corresponding to the N fields read from each XML record, and converting each line character string into a character string meeting a DataX protocol for the processing of a writing thread of DataX.

Reading the M XML records is carried out according to the sequence, and then the line character string corresponding to the N fields read by the XML records is determined every time one XML record is read. And after the line character strings are obtained, converting each line character string into a character string meeting the DataX protocol, and providing the character string to a writing thread of the DataX to consume and write a target data source.

In the embodiment of the application, an XML file to be read is firstly obtained, then, according to the sequence of M XML records, N XPaths are used for sequentially reading the fields in each XML record to obtain N fields, and finally, the N fields read from each XML record are sequentially converted into corresponding character strings meeting the DataX protocol for the processing of the writing thread of the DataX. Therefore, according to the technical scheme provided by the embodiment of the application, on one hand, complex XML data can be processed by reading the XML field by using Xpath; on the other hand, the character strings are sequentially read one by one, and the read character strings can be converted into character strings meeting the DataX protocol for being processed by a writing thread of the DataX.

An embodiment of the present application provides a method for reading an XML document, and fig. 2 is a schematic diagram illustrating an implementation flow of another method for reading an XML document according to an embodiment of the present application, as shown in fig. 2, where the method includes:

step S201, obtaining an XML file comprising M XML records, wherein M is an integer greater than or equal to 1;

step S202, determining N circular paths and N extraction fields which are in one-to-one correspondence with the N circular paths according to the reading requirement of the XML file;

the circular path refers to the full path from the root node to the start node of the circular data in the XML file, and the circular path is divided by a "/" symbol, wherein the character string after the first "/" is the root node by default. For example: /ROOT/LEAF/ITEM. The extraction Field (Field) can be set to attribute (@ id) or element (advicegroup pid) according to requirements. One extraction field is required for each circular path. And determining N circular paths and N extraction fields which are in one-to-one correspondence with the N circular paths according to the reading requirement of the XML file.

Step S203, assembling an Xpath according to each of the N cyclic paths and the corresponding N extraction fields to obtain the N Xpaths;

each circular path and corresponding decimation field are assembled into an Xpath. For example, if the circular path is/ROOT/LEAF/ITEM, and the extracted Field is attribute (@ id) or element (advicegroup), then Xpath assembled according to LoopPath and Field is specifically the following two cases:

1) when the extracted field is an attribute, the assembled Xpath is:

the/ROOT/LEAF/ITEM/@ id represents that the @ id of the node in the XML file corresponding to the cyclic path of/ROOT/LEAF/ITEM is extracted;

2) when the extracted field is an element, the assembled Xpath is:

the term/ROOT/LEAF/ITEM/advicegroup pid indicates to extract advicegroup pids of nodes in the XML file corresponding to the cyclic path of/ROOT/LEAF/ITEM.

The N cyclic paths and the corresponding N decimated fields are assembled into N xpaths.

Step S204, when the M XML records are scanned in a stream scanning mode, sequentially reading the fields in each XML record by using N XPaths according to the sequence of the M XML records, and correspondingly obtaining N fields;

the stream scanning may use an XML simple application program interface (SAX) element processing parser (ElementHandler), and its working principle is simply to scan the document sequentially, and when the document starts and ends, the elements start and end, and the document ends, the event processing function is notified, the event processing function performs corresponding actions, and then the same scanning is continued until the document ends. This scanning mode is called a scanning mode by stream. The event handling function in this case refers to reading fields in each XML record using Xpath. The M XML records can be scanned by using an ElementHandler, and the fields in each XML record are sequentially read by using N XPaths according to the sequence of the M XML records in the scanning process, so that N fields are correspondingly obtained.

Step S205, determining a line character string corresponding to the N fields read from each XML record according to the sequence of the M XML records, and converting each line character string into a character string meeting a DataX protocol for processing by a DataX write thread.

The embodiment of the present application first describes how to assemble N circular paths and N extraction fields corresponding to the N circular paths one by one to obtain N xpaths according to the reading requirement of the XML file. Then, it is described that a document can be sequentially scanned by using an SAX element processing parser, and in the scanning process, according to the sequence of the M XML records, the fields in each XML record are sequentially read by using N xpaths, so as to correspondingly obtain N fields. Therefore, on one hand, in the scanning process, the SAX element processing parser is used for sequentially reading the fields in each XML record by using N XPaths according to the sequence of the M XML records, so that a subsequent field can be conveniently processed and read for the write thread processing of DataX; on the other hand, the element nodes and the attribute nodes can be conveniently extracted by reading XML by using Xpath.

An embodiment of the present application provides a method for reading an XML document, and fig. 3A is a schematic diagram illustrating an implementation flow of another method for reading an XML document according to an embodiment of the present application, as shown in fig. 3A, the method includes:

s301, acquiring an XML file comprising M XML records, wherein M is an integer greater than or equal to 1;

step S302, according to the sequence of the M XML records, sequentially reading the fields in each XML record by using N XPaths to correspondingly obtain N fields, wherein N is an integer greater than or equal to 1;

step S303, determining N fields read from each XML record according to the sequence of the M XML records;

step S304, converting each of the N fields into corresponding line character strings by using separators;

the read N fields are converted into String type line String (lineString) by using configured separators, and the line String is an XML record.

Step S305, serializing each line character string into a byte sequence character string;

the conversion of the byte sequence string into a string satisfying the DataX protocol can be achieved by a write function.

Step S306, converting each byte sequence character string into a character string meeting a DataX protocol for the write thread processing of the DataX.

The embodiment of the application mainly describes that firstly, a field is converted into a corresponding line character string by using a separator, then each line character string is serialized into a byte sequence character string, and finally each byte sequence character string is converted into a character string meeting a DataX protocol for being processed by a writing thread of the DataX. Thus, the method of separator and serialization can be used for converting the field into the character string meeting the DataX protocol for the write thread processing of the DataX.

An embodiment of the present application provides a method for reading an XML document, and fig. 3B is a schematic flowchart illustrating an implementation of another method for reading an XML document according to an embodiment of the present application, as shown in fig. 3B, where the method includes:

step S311, obtaining an XML file comprising M XML records, wherein M is an integer greater than or equal to 1;

step S312, according to the sequence of the M XML records, sequentially reading the fields in each XML record by using N XPaths to correspondingly obtain N fields, wherein N is an integer greater than or equal to 1;

step 313, determining a line character string corresponding to the N fields read from each XML record according to the sequence of the M XML records;

step S314, sequentially storing the M line character strings into a first blocking queue according to the sequence of obtaining the line character strings, wherein the first blocking queue is a queue supporting blocking operation;

the blocking queue refers to a queue supporting blocking operation and supports blocking addition. Blocking addition refers to blocking the thread of an inserted element when the queue is full, allowing element addition until the queue is not full. The blocking queue follows a First-in-First-out (FIFO) rule. And when the blocking queue is not full, sequentially storing the M row character strings into the first blocking queue according to the sequence of obtaining the row character strings. The XML record obtained by conversion is put into a blocking queue, because subsequently, when the read record is converted into the DataX protocol, the DataX needs to read each record and convert the record. If the XML record obtained by conversion is not put into the blocking queue, a situation may occur that all the XML records obtained by conversion are put into the memory at one time to wait for conversion. For this case, DataX cannot normally implement the conversion of records into the DataX protocol.

Step S315, when the first blocking queue is determined to be not empty, sequentially extracting each line of character strings from the first blocking queue, and serializing each line of character strings into byte sequence character strings;

and determining that the first blocking queue is not empty, namely that the first blocking queue has character strings to be processed, observing FIFO at the moment, and sequentially extracting each line of character strings from the first blocking queue. The retrieved line string needs to be serialized into a byte sequence string.

Step S316, converting each byte sequence character string into a character string meeting a DataX protocol;

step S317, storing the character strings meeting the DataX protocol into a second blocking queue, so that the DataX write thread sequentially extracts each character string meeting the DataX protocol from the second blocking queue for processing.

The step of storing the character strings meeting the DataX protocol into a second blocking queue is to extract each character string meeting the DataX protocol from the second blocking queue one by one for the DataX write thread to process.

The embodiments of the present application generally describe processing data using a blocking queue. Therefore, the analyzed records from the XML file are stored in the blocking queue for consumption of the DataX reading thread, so that the DataX can read one record on the XML file without loading the XML file to the memory at one time, and the large text is prevented from being loaded to the memory at one time, so that the memory pressure of the DataX is reduced.

The related art DataX has implemented a function of reading data from a local file and converting into a DataX protocol. The reading plug-in the DataX realizes the batch extraction of data from the data storage system and converts the data into a DataX standard data exchange protocol; the write plug-in implements the translation from the DataX standard data exchange protocol to a specific data storage type and writes to the destination data storage. Any read plug-in of the DataX can realize seamless butt joint with any write plug-in of the DataX, and the purpose of any heterogeneous data intercommunication is achieved.

According to the related art, although the DataX realizes the functions of reading data from a local file and converting to the DataX protocol, the data currently supports and only supports reading a TXT file, and requires a summary (schema) in the TXT to be a two-dimensional table. The TXT is a text format attached to the operating system by microsoft, is the most common file format, and mainly stores text information, that is, text information.

The embodiment of the application provides a method for providing a File Transfer Protocol (FTP) and a Secure File Transfer Protocol (SFTP) for reading a File on an XML File on a File Transfer Protocol (File Transfer Protocol) or a File Transfer Protocol (Hadoop File System, HDFS) at each time by using the DataX, and converting the record into the DataX Protocol, so as to make up for the loss of the XML extraction capability of the DataX. The XML file is analyzed based on the SAXReader, and the whole XML file is prevented from being loaded into the memory at one time by using an event stream mode, so that the pressure of the memory is reduced. The SAXReader is used for analyzing the XML file and is an XML file analyzing scheme.

An embodiment of the present application provides a method for reading an XML document, and fig. 4 is a schematic flow chart illustrating an implementation of another method for reading an XML document according to an embodiment of the present application, as shown in fig. 4, where the method includes:

step S401, setting a DataXXMLReader plug-in the DataX, wherein the plug-in is used for setting N circular paths and N extraction fields of an XML file to be read, and N is an integer greater than or equal to 1;

a circular path (LoopPath) is a full path in an XML file from the root node to the start node of the circular data. The circular paths are separated by "/" symbols, where the character string after the first "/" is by default the root node. For example: /ROOT/LEAF/ITEM. The extraction Field (Field) can be set to attribute (@ id) or element (advicegroup pid) according to requirements.

Step S402, the DataXXMLreader assembles N XPath according to the received N circular paths and the N extraction fields;

XPath is a language for finding information in XML files for navigating through elements and attributes in XML files. For example, if the circular path is/ROOT/LEAF/ITEM, and the extracted Field is attribute (@ id) or element (advicegroup), then Xpath assembled according to LoopPath and Field is specifically the following two cases:

1) when the extracted field is an attribute, the assembled Xpath is:

the/ROOT/LEAF/ITEM/@ id represents that the @ id of the node in the XML file corresponding to the cyclic path of/ROOT/LEAF/ITEM is extracted;

2) when the extracted field is an element, the assembled Xpath is:

the term/ROOT/LEAF/ITEM/advicegroup pid indicates to extract advicegroup pids of nodes in the XML file corresponding to the cyclic path of/ROOT/LEAF/ITEM.

Step S403, when the SAX element processing parser is used for scanning a file with M XML records, the DataXXMLreader reads fields to be extracted by using N XPaths to obtain M field sets to be extracted, wherein M is an integer greater than or equal to 1, and each field set comprises N fields;

the SAX element processing parser (ElementHandler) has the working principle of sequentially scanning a document, informing an event processing function when the document starts and ends, the elements start and end, the document ends and the like are scanned, performing corresponding actions by the event processing function, and continuing the same scanning until the document ends. This scanning mode is called a scanning mode by stream. The event processing function here is Xpath, that is, when scanning an XML document using the SAX element processing parser, when scanning to the places of document start and end, element start and end, document end, etc., N xpaths are used to read the fields to be extracted, and M field sets are obtained by scanning M XML records, where each field set includes N fields.

Step S404, converting the configured separator for each field set in the obtained M field sets into M line character strings by the DataXXMLreader to obtain a character string corresponding to each field set;

each field set in the M field sets is an XML record, and DataXXMLReader converts a separator of the XML record into a string (lineString).

Step S405, the DataXXMLreader puts the M line character strings obtained by conversion into a blocking queue A one by one;

the blocking queue refers to a queue supporting blocking operation and supports blocking addition. Blocking add refers to blocking the thread of an inserted element when the queue is full until the queue is not full to allow add.

The reason why the M XML records obtained by conversion are put into the blocking queue one by one is that when the read records are converted into the DataX protocol, the DataX protocol needs to read and convert each record respectively. If the M XML records obtained by conversion are not put into the blocking queue a one by one, a situation that all the XML records obtained by conversion are put into the memory at one time to wait for conversion occurs. For this case, DataX cannot normally implement the conversion of records into the DataX protocol.

XML records are put into the blocking queue A one by one, so that data can be read one by one when being read from the queue A subsequently, and the effect of reading data from the txt file is equivalently produced for the reading process of the DataX, and the adaptation of the requirement of the DataX is realized.

Step S406, when a row character string exists in the blocking queue A, the DataXXMLreader takes out one row character string to be serialized into a byte sequence character string;

the reading thread of the DataX can continuously inquire whether the XML file is scanned and is finished and whether the blocking queue A is empty, if the scanning is finished and the blocking queue is empty, the flow is finished, otherwise, the analyzed XML record is obtained from the blocking queue A and consumed. When a row string exists in the blocking queue a, a row string serialization (Serde) into a byte sequence string is fetched.

Step S407, the DataXXMLReader converts the byte sequence character string into a character string meeting the DataX protocol and writes the character string into a blocking queue B to wait for consumption by a writing thread of DataX.

The new string is converted to the DataX protocol and written to blocking queue B for the DataX's write thread to consume the write destination data source.

According to the method and the device, a blocking queue for reading thread consumption is added to the DataX, and the analyzed records from the XML file are stored in the blocking queue for reading thread consumption of the DataX by using the event stream of the SAXReader, so that the DataX can read one record on the XML file every time without loading the XML file to a memory at one time, and the record is converted into a DataX protocol. The XML is extracted by using the Xpath, so that element nodes, attribute nodes and element nodes can be conveniently extracted, and complex XML data can be processed.

Based on the foregoing embodiments, the present application provides an apparatus for reading an XML file, where the apparatus includes modules and units included in the modules, and the apparatus can be implemented by a processor in a device for reading an XML file; of course, the implementation can also be realized through a specific logic circuit; in implementation, the processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 5 is a schematic structural diagram of an apparatus for reading an XML document according to an embodiment of the present application, and as shown in fig. 5, the apparatus 500 includes: an obtaining module 501, a reading module 502 and a determining module 503, wherein:

the obtaining module 501 is configured to sequentially obtain, according to the sequence of the M XML records, the fields in each XML record by using N xpaths, so as to obtain N fields correspondingly, where N is an integer greater than or equal to 1;

the reading module 502 is configured to sequentially read, according to the sequence of the M XML records, the fields in each XML record by using N xpaths, so as to obtain N fields correspondingly, where N is an integer greater than or equal to 1;

the determining module 503 is configured to determine, according to the sequence of the M XML records, a line character string corresponding to the N fields read from each XML record, and convert each line character string into a character string meeting a DataX protocol, so as to be processed by a DataX write thread.

In some embodiments, the reading module further includes an assembling unit and a reading unit, where the assembling unit is configured to assemble N xpaths according to the reading requirement of the XML file; the reading unit is configured to, when the M XML records are scanned in a stream scanning manner, sequentially read the fields in each XML record by using N xpaths according to the sequence of the M XML records, and correspondingly obtain N fields.

In some embodiments, the assembly unit further includes a determination subunit and an assembly subunit, where the determination subunit is configured to determine, according to a reading requirement of the XML file, N cyclic paths and N extraction fields corresponding to the N cyclic paths one to one; and the assembling subunit is configured to assemble an Xpath according to each of the N cyclic paths and the corresponding N decimation fields, so as to obtain the N xpaths.

In some embodiments, the determining module further includes a determining unit, a first converting unit, a storing unit, an extracting unit, a serializing unit, and a second converting unit, where the determining unit is configured to determine, according to the sequence of the M XML records, N fields read from each of the XML records; the first conversion unit is used for converting each N fields into corresponding line character strings by using separators; a storing unit, configured to store the M line character strings in a first blocking queue in sequence according to a sequence of obtaining the line character strings, where the first blocking queue is a queue that supports a blocking operation; the extracting unit is configured to sequentially extract each of the line character strings from the first blocking queue when it is determined that the first blocking queue is not empty; the serialization unit is used for serializing each line character string into a byte sequence character string; the second conversion unit is used for converting each byte sequence character string into a character string meeting a DataX protocol so as to be processed by a writing thread of the DataX.

In some embodiments, the second conversion unit further comprises a conversion subunit and a storage subunit, wherein the conversion subunit is configured to convert each of the byte sequence strings into a string satisfying a DataX protocol; and the storing subunit is configured to store the character strings meeting the DataX protocol into a second blocking queue, so that the DataX write thread sequentially extracts each of the character strings meeting the DataX protocol from the second blocking queue for processing.

Here, it should be noted that: the above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the method for reading an XML file is implemented in the form of a software functional module and is sold or used as a standalone product, the method may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a device automatic test line including the storage medium to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Correspondingly, an embodiment of the present application provides an apparatus for reading an XML document, such as a computer apparatus and a server, fig. 6 is a schematic diagram of a hardware entity of the apparatus for reading an XML document, as shown in fig. 6, a hardware entity of the apparatus 600 for reading an XML document includes: comprising a memory 601 and a processor 602, said memory 601 storing a computer program operable on the processor 602, said processor 602 implementing the steps in the software encryption method provided in the above embodiments when executing said program.

The Memory 601 is configured to store instructions and applications executable by the processor 602, and may also cache data to be processed or already processed by each module in the apparatus 600 for the processor 602 and reading XML files, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device (which may be a mobile phone, a tablet computer, a notebook computer, a desktop computer) and a server to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:数据处理方法、装置、计算机系统和计算机可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!