Method for analyzing Excel file to multiple tables and improving data precision based on SAX

文档序号:68865 发布日期:2021-10-01 浏览:16次 中文

阅读说明:本技术 基于SAX解析Excel文件至多表并提高数据精度的方法 (Method for analyzing Excel file to multiple tables and improving data precision based on SAX ) 是由 张庆兵 于 2021-07-20 设计创作,主要内容包括:本发明公开一种基于SAX解析Excel文件至多表并提高数据精度的方法,涉及数据处理技术领域,包括:首先获取excel文件,excel文件包含多个映射,每个映射为excel文件的表头,配置目标数据表与表头的映射关系;然后比对excel文件中表头信息与配置目标数据表的映射关系,通过SAX解析excel文件每一个数据单元,并进行映射关系匹配验证,最后将excel文件按照匹配成功的映射关系进行数据导入拆分,当遍历数据到达设定的数据阈值或为新表头时,启动多线程和数据源的连接池进行数据导入处理。本发明旨在将excel文件通过SAX解析方式导入多个数据库表中,同时解决数据会丢失精度问题、时间格式无法正常转换问题和传统DOM解析方式带来的问题。(The invention discloses a method for analyzing Excel files to multiple tables and improving data precision based on SAX, which relates to the technical field of data processing and comprises the following steps: firstly, acquiring an excel file, wherein the excel file comprises a plurality of mappings, each mapping is a header of the excel file, and a mapping relation between a target data table and the header is configured; and then comparing the mapping relation between header information in the excel file and a configuration target data table, analyzing each data unit of the excel file through SAX, performing mapping relation matching verification, finally performing data import and splitting on the excel file according to the successfully matched mapping relation, and starting a connecting pool of multiple threads and a data source to perform data import processing when traversal data reaches a set data threshold or is a new header. The invention aims to introduce excel files into a plurality of database tables in an SAX (document object model) analysis mode, and simultaneously solves the problems that data can lose precision, time formats cannot be normally converted and the problems caused by the traditional DOM analysis mode.)

1. a method for analyzing Excel files to multiple tables and improving data accuracy based on SAX is characterized in that Excel files are obtained firstly, each Excel file comprises a plurality of mappings, each mapping is a table head of each Excel file, and a mapping relation between a target data table and the table head is configured; and then comparing the mapping relation between header information in the excel file and a configuration target data table, analyzing each data unit of the excel file through SAX, performing mapping relation matching verification, finally performing data import and splitting on the excel file according to the successfully matched mapping relation, and starting a connecting pool of multiple threads and a data source to perform data import processing when traversal data reaches a set data threshold or is a new header.

2. The method for parsing an Excel file into multiple tables and improving data accuracy based on SAX according to claim 1, wherein the obtained Excel file includes two versions 03 and 07.

3. The method according to claim 2, wherein for a 03 version of Excel file, it is necessary to write an abstract XLSReader that implements hssflisterer interface, and implement a processRecord method for reading rows and columns, and process according to record.

4. The method for parsing Excel file into multiple tables and improving data accuracy based on SAX as claimed in claim 2, wherein for version 07 Excel file, abstract class XLSXReader inheriting DefaultHandler needs to be written, rewriting startElement and endElement for reading rank.

5. The method as claimed in claim 2 or 3, wherein during reading the rows and columns, the default of date type in the number is converted into yyyy-MM-dd HH: MM: ss: SSS format for uniform processing.

6. The method for parsing Excel file into multiple tables and improving data accuracy based on SAX according to claim 1, wherein a data threshold needs to be set according to the thread number of a deployment machine, a memory, and Excel data size before comparing the mapping relationship between header information in the Excel file and a configuration target data table.

7. The method for analyzing Excel files to multiple tables and improving data accuracy based on SAX according to claim 1, wherein the mapping relationship between header information in an Excel file and a configuration target data table is compared, each data unit of an Excel file is analyzed by SAX, and mapping relationship matching verification is performed, specifically:

traversing each row of the excel file in an SAX mode, matching the rows with the mapping relation, confirming relevant information of the matching mapping relation in the excel file, counting the number of the heads of the lists meeting the mapping, and recording the sheet page and row and column information meeting the mapping which is successfully matched.

8. The method of claim 7, wherein a mapping relationship comprises a mapping of headers and columns of data.

Technical Field

The invention relates to the technical field of data processing, in particular to a method for analyzing Excel files to multiple tables and improving data accuracy based on SAX.

Background

For the analysis and import of the excel file into the database table, the traditional method has the following problems:

(1) in a traditional DOM (document object model) analysis mode, one excel file of several M is obtained, and the analysis result occupies hundreds of M of memories, so that the JVM memory overflows;

(2) each excel file only has one header information, and when a plurality of header information exist in one excel file, the excel file is required to be divided into a plurality of excels with different headers (one header corresponds to one excel data file) and is imported one by one;

(3) the mapping relation between the excel header and the data table must be completely matched, and partial matching cannot be carried out according to the specified requirement;

(4) the SAX analysis data loses precision and time format and cannot be converted normally.

For the four problems, a method for analyzing Excel files to multiple tables and improving data accuracy based on SAX is designed and developed, so that the traditional method for analyzing Excel is converted into the SAX method for analysis, a plurality of different mapping relations between Excel table headers and data tables are configured, SAX abstract classes are rewritten, the four problems are successfully solved, and the introduction of the Excel files is efficient and convenient.

Disclosure of Invention

Aiming at the requirements and the defects of the prior art development, the invention provides a method for analyzing an Excel file to multiple tables and improving the data precision based on SAX, aiming at importing the Excel file into a plurality of database tables in an SAX analysis mode and solving the problems that the precision of the data is lost, the time format cannot be normally converted and the problems caused by the traditional DOM analysis mode.

The invention discloses a method for analyzing Excel files to multiple tables and improving data accuracy based on SAX, which adopts the following technical scheme for solving the technical problems:

a method for analyzing Excel files to multiple tables and improving data accuracy based on SAX comprises the steps of firstly obtaining Excel files, wherein each Excel file comprises a plurality of mappings, each mapping is a table header of each Excel file, and a mapping relation between a target data table and the table header is configured; and then comparing the mapping relation between header information in the excel file and a configuration target data table, analyzing each data unit of the excel file through SAX, performing mapping relation matching verification, finally performing data import and splitting on the excel file according to the successfully matched mapping relation, and starting a connecting pool of multiple threads and a data source to perform data import processing when traversal data reaches a set data threshold or is a new header.

Optionally, the obtained excel file includes version 03 and version 07.

Further optionally, for the 03 version of excel file, it is necessary to write an abstract XLSReader that implements an hssflisterner interface, implement a processRecord method for reading a row and a column, and process according to record.

Further optionally, for the version 07 excel file, an abstract XLSXReader inheriting the DefaultHandler needs to be written, and startElement and endElement need to be rewritten for reading the line and row.

Preferably, during the reading of the row and column, the default of date type in the number is converted into yyyy-MM-dd HH: MM: ss: SSS format for unified processing.

Optionally, before comparing the mapping relationship between the header information in the EXCEL file and the configuration target data table, a data threshold needs to be set according to the thread number of the deployment machine, the memory, and the EXCEL data size.

Optionally, comparing the mapping relationship between the header information in the excel file and the configuration target data table, analyzing each data unit of the excel file through the SAX, and performing mapping relationship matching verification, specifically:

traversing each row of the excel file in an SAX mode, matching the rows with the mapping relation, confirming relevant information of the matching mapping relation in the excel file, counting the number of the heads of the lists meeting the mapping, and recording the sheet page and row and column information meeting the mapping which is successfully matched.

Further optionally, a mapping relationship includes a mapping of a plurality of headers and data columns.

Compared with the prior art, the method for analyzing the Excel file to the multi-table and improving the data precision based on the SAX has the beneficial effects that:

(1) the invention aims to introduce excel files into a plurality of database tables in an SAX (document object model) analysis mode, and simultaneously solves the problems that data can lose precision, time formats cannot be normally converted and the problems caused by the traditional DOM analysis mode;

(2) the method adopts an SAX mode to analyze the excel file, configures a plurality of mapping relations between the excel header and the data sheet, and rewrites an SAX abstract class to improve the flexibility, efficiency and accuracy of the introduction of the excel file; in addition, the data precision and the date data processing capacity are improved through data processing of abstract XLSXReader and XLSReader.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In order to make the technical scheme, the technical problems to be solved and the technical effects of the present invention more clearly apparent, the following technical scheme of the present invention is clearly and completely described with reference to the specific embodiments.

The first embodiment is as follows:

the embodiment provides a method for analyzing an Excel file to multiple tables and improving data accuracy based on SAX, which includes:

firstly, an excel file is obtained, the excel file comprises a plurality of mappings, each mapping is a header of the excel file, and a mapping relation between a target data table and the header is configured.

And setting a data threshold according to the thread number of the deployment machine, the memory and the EXCEL data size.

Then, comparing the mapping relation between the header information in the excel file and the configuration target data table, analyzing each data unit of the excel file through SAX, and performing mapping relation matching verification, wherein the mapping relation specifically comprises the following steps:

traversing each row of the excel file in an SAX mode, matching the rows with mapping relations, and confirming related information of the matched mapping relations in the excel file, wherein one mapping relation comprises mapping of a plurality of headers and data columns; and counting the number of the headers meeting the mapping, and recording the sheet page and row-column information meeting the mapping which is successfully matched.

And finally, carrying out data import and splitting on the excel file according to the successfully matched mapping relation, and starting a multithreading and data source connection pool to carry out data import processing when the traversal data reaches a set data threshold value or is a new header.

In this embodiment, the obtained excel file is 03 version, at this time, an abstract XLSReader for implementing an hssflisterner interface needs to be written, a processRecord method is implemented for reading rows and columns, and processing is performed according to record. During the reading of the row and column, the default of date type in the number is converted into yyyy-MM-dd HH: MM: ss: SSS format for unified processing, in particular personalized m/d/yy,/m, m/,/d, d/,/year, year/,/month, month/,/day, day/and the like.

The concrete implementation code for writing the abstract XLSREADer is as follows:

example two:

the embodiment provides a method for analyzing an Excel file to multiple tables and improving data accuracy based on SAX, which includes:

firstly, an excel file is obtained, the excel file comprises a plurality of mappings, each mapping is a header of the excel file, and a mapping relation between a target data table and the header is configured.

And setting a data threshold according to the thread number of the deployment machine, the memory and the EXCEL data size.

Then, comparing the mapping relation between the header information in the excel file and the configuration target data table, analyzing each data unit of the excel file through SAX, and performing mapping relation matching verification, wherein the mapping relation specifically comprises the following steps:

traversing each row of the excel file in an SAX mode, matching the rows with mapping relations, and confirming related information of the matched mapping relations in the excel file, wherein one mapping relation comprises mapping of a plurality of headers and data columns; and counting the number of the headers meeting the mapping, and recording the sheet page and row-column information meeting the mapping which is successfully matched.

And finally, carrying out data import and splitting on the excel file according to the successfully matched mapping relation, and starting a multithreading and data source connection pool to carry out data import processing when the traversal data reaches a set data threshold value or is a new header.

In this embodiment, the obtained excel file is 07 version, and at this time, it is necessary to write an abstract XLSXReader that inherits DefaultHandler, and rewrite startElement and endElement for reading a row and a column. During the reading of the row and column, the default of date type in the number is converted into yyyy-MM-dd HH: MM: ss: SSS format for unified processing, in particular personalized m/d/yy,/m, m/,/d, d/,/year, year/,/month, month/,/day, day/and the like.

The concrete implementation code for writing the abstract XLSXReader is as follows:

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:自动生成摘要文档的方法和装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!