Data preprocessing system supporting multiple file formats

文档序号:153095 发布日期:2021-10-26 浏览:11次 中文

阅读说明:本技术 一种支持多种文件格式的数据预处理系统 (Data preprocessing system supporting multiple file formats ) 是由 李冬萍 杨迎春 于 2021-07-22 设计创作,主要内容包括:本发明公开一种支持多种文件格式的数据预处理系统,包括中央处理模块、配置文件管理模块、解析模式处理模块、分隔符模式处理模块、EXCEL模式处理模块、动态链接模式处理模块、归整处理模块、分拣处理模块、文本文件输出模块、mysql输出模块、kafka输出模块、日志管理模块;通过科学合理的系统设计,能够将在数据分析、处理工作中不完整,不一致,不规则的数据,归整处理为满足系统要求的规则数据,并通过分拣处理,能够把指定的内容输出到要求的存储介质中,提高了数据分析、处理的质量,降低实际处理的所需要的时间和难度;满足后续系统需要的标准数据,支持的源数据格式多,输出介质丰富,智能化程度高,通过配置就能实现,值得推广和使用。(The invention discloses a data preprocessing system supporting multiple file formats, which comprises a central processing module, a configuration file management module, an analysis mode processing module, a separator mode processing module, an EXCEL mode processing module, a dynamic link mode processing module, a sorting processing module, a text file output module, a mysql output module, a kafka output module and a log management module, wherein the configuration file management module is used for managing a plurality of files; through scientific and reasonable system design, incomplete, inconsistent and irregular data in data analysis and processing work can be integrated into regular data meeting system requirements, and designated contents can be output to a required storage medium through sorting processing, so that the quality of data analysis and processing is improved, and the time and difficulty required by actual processing are reduced; the method has the advantages of meeting the standard data required by a subsequent system, supporting a plurality of source data formats, being rich in output media and high in intelligence degree, being realized through configuration and being worthy of popularization and use.)

1. A data preprocessing system supporting multiple file formats, characterized by: the system comprises a central processing module, a configuration file management module, an analysis mode processing module, a separator mode processing module, an EXCEL mode processing module, a dynamic link mode processing module, a sorting processing module, a text file output module, a mysql output module, a kafka output module and a log management module;

the central processing module calls a configuration file management module through the parameter channel number, finds the configuration information of the channel number, obtains a data source file directory, a processing mode and an output mode which need to be processed, scans the data source file directory, and calls a corresponding processing module to process files meeting conditions according to the processing mode;

the configuration file management module provides project configuration information and an access method required by program operation, and carries out audit verification on the integrity of the configuration information, and the configuration files are placed in a conf directory of a directory where the program is located and use a YAML format;

the analysis mode processing module is used for defining fields of each record of the source file through key value bits, describing information such as the file type of a file to be processed, whether a file header exists, the length of each record, the data type of each field, the initial position and the length of each field in the record and the like through a configuration file, and reading specified fields in each record by the processing module according to configuration information for subsequent processing;

the separator mode processing module is used for defining fields of each record of the source file through specified field division characters, describing the information such as whether left and right spaces need to be removed after the field division characters and field values of the file to be processed are taken out through a configuration file, and reading the specified fields in each record for subsequent processing according to the configuration information;

the EXCEL mode processing module is used for directly reading the EXCEL file to acquire a designated field for subsequent processing when the source file is the EXCEL file with the defined content;

the dynamic link mode processing module is used for compiling a source file into a dynamic link library through a compiling program, and taking out field information in the file through system calling, wherein the source file is a data file with a specific format;

the integral processing module is used for generating a standard data file format defined by the system and generating related attributes of data according to different rules;

the sorting processing module is mainly used for outputting data to different output channels according to the configuration rule of a user in a configuration file;

the text file output module is mainly used for writing the field content output by the sorting processing module into a specified text file according to the configuration requirement;

the mysql output module has the main function of outputting the field content output by the sorting processing module to a specified mysql database according to the configuration requirement;

the kafka output module has the main function of writing the field content output by the sorting processing module into a specified kafka message queue according to the configuration requirement;

the log management module is used for writing the processed file information into a database and writing the abnormal information generated in the processing process into a log file, the system provides a basis for flow monitoring and can count information aiming at the abnormal data, so that a user can conveniently analyze and know the service operation condition;

the data preprocessing system processes and processes the data in the original data file through modules such as sorting and sorting, arranges the data by a user-defined method, and stores the processed data in a specified medium to provide standard data for a subsequent system.

2. The data pre-processing system that supports multiple file formats as recited in claim 1, wherein: the central processing module calls the corresponding processing module to process according to the processing mode, and the processing mode of the source file supports an analysis mode, a separator mode, an EXCEL mode and a dynamic link mode.

3. The data pre-processing system that supports multiple file formats as recited in claim 1, wherein: the sorting processing module outputs data to different output channels, and can output specified contents to required storage media for the sorting and sorting functions of original data files, and the supported storage media comprise text files, mysql databases and kafka message systems.

4. The data pre-processing system that supports multiple file formats as recited in claim 1, wherein: the configuration file management is configured by taking a channel number as a unit, a module number is under the channel number, one channel is provided with a plurality of modules, each module is provided with a data source file directory, a processing mode and an output mode, the module is the minimum unit of system operation, one channel corresponds to one process, each module under the channel corresponds to one thread, and multiple data files can realize the parallel processing function.

Technical Field

The invention belongs to the technical field of big data processing, and relates to a data preprocessing system supporting multiple file formats.

Background

The big data processing flow can be summarized into four steps, namely, the steps of collecting, importing, preprocessing, counting, analyzing, mining, importing and preprocessing need to solve the data processing problem caused by a plurality of problems of a large amount of data, data in different formats and the like, source data need to be processed into relatively regular data, different processing methods are possible for data of files in different formats, however, no matter what data, in the whole data processing process, some steps and methods are always universal, and a data file preprocessing system can complete the universal flows of data processing, so that the data processing time is reduced, and the complicated processing flow is simplified.

The existing data preprocessing scheme can only process some source files with specific formats, is difficult to support when meeting new-format source files, cannot flexibly process various source files through configuration files, cannot carry out dynamic sorting, and can only output text files after preprocessing.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a data preprocessing system supporting a plurality of file formats, which provides functions of sorting and sorting source data files in a plurality of formats and provides data in a standard format for a subsequent data processing and analyzing module.

The invention adopts the following technical scheme:

a data preprocessing system supporting multiple file formats comprises a central processing module, a configuration file management module, an analysis mode processing module, a separator mode processing module, an EXCEL mode processing module, a dynamic link mode processing module, a sorting processing module, a text file output module, a mysql output module, a kafka output module and a log management module;

the central processing module calls a configuration file management module through the parameter channel number, finds the configuration information of the channel number, obtains a data source file directory, a processing mode and an output mode which need to be processed, scans the data source file directory, and calls a corresponding processing module to process files meeting conditions according to the processing mode;

the configuration file management module provides project configuration information and an access method required by program operation, and carries out audit verification on the integrity of the configuration information, and the configuration files are placed in a conf directory of a directory where the program is located and use a YAML format;

the analysis mode processing module is used for defining fields of each record of the source file through key value bits, describing information such as the file type of a file to be processed, whether a file header exists, the length of each record, the data type of each field, the initial position and the length of each field in the record and the like through a configuration file, and reading specified fields in each record by the processing module according to configuration information for subsequent processing;

the separator mode processing module is used for defining fields of each record of the source file through specified field division characters, describing the information such as whether left and right spaces need to be removed after the field division characters and field values of the file to be processed are taken out through a configuration file, and reading the specified fields in each record for subsequent processing according to the configuration information;

the EXCEL mode processing module is used for directly reading the EXCEL file to acquire a designated field for subsequent processing when the source file is the EXCEL file with the defined content;

the dynamic link mode processing module is used for compiling a source file into a dynamic link library through a compiling program, and taking out field information in the file through system calling, wherein the source file is a data file with a specific format;

the integral processing module is used for generating a standard data file format defined by the system and generating related attributes of data according to different rules;

the sorting processing module is mainly used for outputting data to different output channels according to the configuration rule of a user in a configuration file;

the text file output module is mainly used for writing the field content output by the sorting processing module into a specified text file according to the configuration requirement;

the mysql output module has the main function of outputting the field content output by the sorting processing module to a specified mysql database according to the configuration requirement;

the kafka output module has the main function of writing the field content output by the sorting processing module into a specified kafka message queue according to the configuration requirement;

the log management module is used for writing the processed file information into a database and writing the abnormal information generated in the processing process into a log file, the system provides a basis for flow monitoring and can count information aiming at the abnormal data, so that a user can conveniently analyze and know the service operation condition;

the data preprocessing system processes and processes the data in the original data file through modules such as sorting and sorting, arranges the data by a user-defined method, and stores the processed data in a specified medium to provide standard data for a subsequent system.

Further, the central processing module calls the corresponding processing module to process according to the processing mode, and the processing mode of the source file supports an analysis mode, a separator mode, an EXCEL mode and a dynamic link mode.

Furthermore, the integration processing module is used for generating a standard data file format defined by the system, the sorting processing module outputs data to different output channels, and the integration and sorting functions of the original data file can output specified contents to a required storage medium, and the supported storage medium comprises a text file, a mysql database and a kafka message system;

furthermore, the configuration file management is configured by taking a channel number as a unit, a module number is below the channel number, one channel is provided with a plurality of modules, each module is provided with a data source file directory, a processing mode and an output mode, the module is the minimum unit of system operation, one channel corresponds to one process, each module below the channel corresponds to one thread, and multiple data files can realize the parallel processing function.

The invention has the beneficial effects that:

the invention comprises a central processing module, a configuration file management module, an analysis mode processing module, a separator mode processing module, an EXCEL mode processing module, a dynamic link mode processing module, a sorting processing module, a text file output module, a mysql output module, a kafka output module and a log management module. The invention can process the incomplete, inconsistent and irregular data in the data analysis and processing work into regular data meeting the system requirements through scientific and reasonable system design, and can output the specified content to the required storage medium through sorting processing, thereby improving the quality of data analysis and processing and reducing the time and difficulty required by actual processing. The source data are processed into standard data meeting the requirements of a follow-up system, the supported source data are multiple in format, the output media are rich, the intelligent degree is high, the method can be realized through configuration, and the method is worthy of popularization and use.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples.

As shown in fig. 1, the data preprocessing system supporting multiple file formats according to the present invention includes a central processing module, a configuration file management module, a parsing mode processing module, a separator mode processing module, an EXCEL mode processing module, a dynamic link mode processing module, a sorting processing module, a text file output module, a mysql output module, a kafka output module, and a log management module;

the central processing module calls the configuration file management module through the parameter channel number, finds the configuration information of the channel number, obtains the contents of a data source file directory, a processing mode, an output mode and the like which need to be processed, scans the data source file directory, and calls a corresponding processing module to process files meeting conditions according to the processing mode.

The configuration file management module provides project configuration information and an access method required by program operation, and audits and verifies the integrity of the configuration information, and the configuration file is placed in a conf directory of a directory where the program is located and uses a YAML format; when the central processing module calls the configuration file management module, finding the configuration information of the corresponding channel according to the parameters, and carrying out the next processing after verifying that the configuration has no error;

the configuration file management module is a basis of system operation, and is configured by taking a channel number as a unit, the module number is below the channel number, one channel is provided with a plurality of modules, each module is provided with contents such as a data source file directory, a processing mode, an output mode and the like, the module is the minimum unit of system operation, one channel corresponds to one process, each module below the channel corresponds to one thread, and a plurality of data files can realize parallel processing functions. For data files needing to be processed for the first time, configuration files are required to be configured first, and key information such as a processing mode, a source data format, sorting configuration information, input catalogues, output modes, backup catalogues, error catalogues and the like is configured according to the types of the source data files; when the processing is started, reading input channel directory information from the configuration file, and searching a corresponding processing module in the configuration file according to the input channel directory information, wherein one input channel directory corresponds to one set of processing modules;

the analysis mode processing module is used for defining fields of each record of the source file through key value bits, describing information such as the file type of a file to be processed, whether a file header exists, the length of each record, the data type of each field, the initial position and the length of each field in the record and the like through a configuration file, and reading specified fields in each record by the processing module according to configuration information for subsequent processing;

if the processing module is found, acquiring data by using the data acquisition method of the processing module and the configured source data format, and orderly, sorting and outputting the information of the acquired data one by one; for a data file with a specific format, compiling the data file into a dynamic link library through a writing program, and calling by a system to take out field information in the file.

The separator mode processing module is used for defining the field of each record of the source file through the specified inter-field segmentation character. Describing information such as dividing characters among fields of a file to be processed and whether left and right spaces need to be removed after field values are taken out through a configuration file, and reading specified fields in each record by a processing module according to configuration information to perform subsequent processing;

the EXCEL mode processing module is used for directly reading the EXCEL file to acquire a designated field for subsequent processing when the source file is the EXCEL file with the defined content;

the dynamic link mode processing module is used for compiling a source file into a dynamic link library through a compiling program, and taking out field information in the file through system calling, wherein the source file is a data file with a specific format;

the integral processing module is used for generating a standard data file format defined by the system and generating related attributes of data according to different rules;

the sorting processing module is mainly used for outputting data to different output channels according to the configuration rule of a user in a configuration file;

the text file output module is mainly used for writing the field content output by the sorting processing module into a specified text file according to the configuration requirement;

the mysql output module has the main function of outputting the field content output by the sorting processing module to a specified mysql database according to the configuration requirement;

the kafka output module has the main function of writing the field content output by the sorting processing module into a specified kafka message queue according to the configuration requirement;

the log management module is used for writing the processed file information into a database and writing the abnormal information generated in the processing process into a log file, the system provides a basis for flow monitoring, and can be used for counting information aiming at the abnormal data, so that a user can conveniently analyze and know the service operation condition.

The processing mode of the source file supports an analysis mode, a separator mode, an EXCEL mode and a dynamic link mode, and can process files in various formats.

The functions of sorting and sorting the original data file can output specified contents to a required storage medium, and the supported storage medium comprises a text file, a mysql database and a kafka message system; reading an output mode from a configuration file, and enabling the system to support that processed standard data can be selectively output to a text file, a mysql database and a kafka message system.

The data preprocessing system processes and processes data in an original data file through modules such as sorting and sorting, arranges the data by a user-defined method, and stores the processed data in a specified medium to provide standard data for a subsequent system. The system also provides a basis for flow monitoring, and can be used for conveniently analyzing and knowing the service operation condition by a user aiming at abnormal data statistical information;

starting a program band parameter, wherein the parameter is a channel number, calling a configuration file management module by a central processing module, finding the configuration content of the channel in a configuration file according to the channel number, reading an input directory, outputting information, backing up the directory, an error directory and other information; finding corresponding data processing module configuration according to input directory

And then searching the source data file in the specified input path, and if the source data file is not found, continuing the search. If the data file is found, the corresponding data processing module is called according to the data processing type to process the data file. If the processing is wrong, writing a log with the error, and continuously searching a new file by the program;

when the source data file is processed, the file is added with tmp in advance to indicate that the file is being processed. Then, acquiring data by using the data acquiring function of the processing module and the configured source data format, and orderly and sorting the information of the acquired data one by one, wherein if any module has a problem, the input data file is deleted, and then the processing file is used as an error file and is put into an error file directory; if the file content is completely taken, the data file is normally backed up to a backup directory, and then the program returns to continue searching the data file;

the data normalization generates a standard data format defined by the system, generates related attributes of the data according to different rules, and processes all normalization configuration information and a normalization process script file configured in a system configuration file;

the data sorting module is mainly used for outputting the standard data files to different output channels according to the configuration rules of the configuration files so as to dynamically filter the data contents. All filtering and segmentation are carried out, and the script file which is configured in the system configuration file and is output in a sorting mode is used for processing;

and calling the corresponding output module according to the configuration content on the standard data subjected to sorting processing, and outputting the standard data to a specified storage medium. The output storage medium may be a text file, a mysql database, or a kafka messaging system;

the log module writes the processed condition into a database and a file, provides a basis for flow monitoring, and can count information aiming at abnormal data so that a user can conveniently analyze and know the service operation condition;

the invention processes the source data into the standard data meeting the requirements of the subsequent system through scientific and reasonable system design, supports a plurality of source data formats, has rich output media and high intelligent degree, can be realized by configuring and compiling scripts, and is worthy of popularization and use.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种数据安全传输系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!