Data acquisition method and device

文档序号：1324303 发布日期：2020-07-14 浏览：2次中文

阅读说明：本技术 一种数据获取方法和装置 (Data acquisition method and device ) 是由赖太平林晓光于 2020-03-11 设计创作，主要内容包括：本发明公开了一种数据获取方法和装置,涉及信息处理技术领域。该方法的一种具体实施方式包括：通过配置文件配置爬虫规则；基于所配置的爬虫规则,根据一个或多个待爬取网站对应的资源定位符,从所述待爬取网站上获取数据文件；根据所述数据文件的类型,将所述数据文件解析为文本文件；根据预设的目标词库,从所述文本文件中提取所述目标词库中一个或多个目标词对应的数据。该实施方式提高了爬虫规则配置的灵活性,且无需修改大量代码、省时省力。(The invention discloses a data acquisition method and device, and relates to the technical field of information processing. One embodiment of the method comprises: configuring a crawler rule through a configuration file; based on the configured crawler rule, acquiring a data file from one or more websites to be crawled according to resource locators corresponding to the websites to be crawled; analyzing the data file into a text file according to the type of the data file; and extracting data corresponding to one or more target words in the target word library from the text file according to a preset target word library. The method and the system improve the flexibility of rule configuration of the crawler, do not need to modify a large number of codes, and are time-saving and labor-saving.)

1. A method of data acquisition, comprising:

configuring a crawler rule through a configuration file;

based on the configured crawler rule, acquiring a data file from one or more websites to be crawled according to resource locators corresponding to the websites to be crawled;

analyzing the data file into a text file according to the type of the data file;

and extracting data corresponding to one or more target words in the target word library from the text file according to a preset target word library.

2. The data acquisition method according to claim 1,

the crawler rules include: the crawling device comprises a crawler name, a crawling path, a crawling element, a keyword corresponding to the crawling element and a crawling length corresponding to the crawling element.

3. The data acquisition method according to claim 1,

and extracting data corresponding to one or more target words in the target word bank from the text file by using a regular matching rule according to a preset target word bank.

4. The data acquisition method according to claim 3,

and extracting data corresponding to one or more target words in the target word library from the text file by using a natural language processing technology according to a preset target word library.

5. The data acquisition method according to claim 4, wherein the extracting data corresponding to one or more target words in the target lexicon from the text file by using a natural language processing technology according to a preset target lexicon comprises:

performing word segmentation processing on the text file;

acquiring one or more synonyms corresponding to the target word from the word segmentation processing result corresponding to the text file;

disambiguating the one or more synonyms;

and extracting data corresponding to the target word from the text file according to the target word and one or more synonyms corresponding to the target word after the ambiguity is eliminated.

6. The data acquisition method according to claim 5, further comprising:

calculating the data extraction rate and the data accuracy rate corresponding to the target words or synonyms according to the extracted data corresponding to the target words or synonyms;

and adjusting the target word bank and the regular matching rule according to the data extraction rate and the data accuracy rate.

7. The data acquisition method according to claim 1,

the data file is one or more of an HTM L file, a DOC file, a DOCX file, a PDF file and a picture file.

8. A data acquisition apparatus, comprising: the system comprises a crawler rule configuration module, a data file acquisition module, a data file analysis module and a data extraction module; wherein the content of the first and second substances,

the crawler rule configuration module is used for configuring crawler rules through configuration files;

the data file acquisition module is used for acquiring data files from the websites to be crawled according to the resource locators corresponding to one or more websites to be crawled on the basis of the configured crawler rules;

the data file analyzing module is used for analyzing the data file into a text file according to the type of the data file;

and the data extraction module is used for extracting data corresponding to one or more target words in the target word bank from the text file according to a preset target word bank.

9. A data acquisition electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, implement the method of any of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a data acquisition method and apparatus.

Background

With the development of information technology, more and more data are published or disclosed through a website platform, and in order to obtain various public data in time and make decisions based on data analysis, data need to be collected from various websites one by one. At present, a commonly used method for collecting data from each website is to manually collect data or crawl data from each website by using a general crawler technology, that is, a crawler entry is configured through a crawler technology framework, and crawler rules are written into crawler codes to run so as to crawl data from each website.

Due to the fact that the data structures of the large websites are different, the display data forms of different website numbers are different, and the crawler rules need to be adjusted adaptively. However, because the crawler rule is generally set in the crawler code, a large number of codes need to be modified when the crawler rule is modified, which is time-consuming and labor-consuming, and the crawler rule cannot be flexibly configured for different websites adaptively.

Disclosure of Invention

In view of this, the present invention provides a data acquisition method, which configures a crawler rule through a configuration file, so that the crawler rule can be flexibly configured according to different websites without modifying a large number of codes.

To achieve the above object, according to a first aspect of the present invention, there is provided a data acquisition method including:

configuring a crawler rule through a configuration file;

based on the configured crawler rule, acquiring a data file from one or more websites to be crawled according to resource locators corresponding to the websites to be crawled;

analyzing the data file into a text file according to the type of the data file;

and extracting data corresponding to one or more target words in the target word library from the text file according to a preset target word library.

Optionally, the crawler rules include: the crawling device comprises a crawler name, a crawling path, a crawling element, a keyword corresponding to the crawling element and a crawling length corresponding to the crawling element.

Optionally, according to a preset target word library, extracting data corresponding to one or more target words in the target word library from the text file by using a regular matching rule.

Optionally, according to a preset target word library, extracting data corresponding to one or more target words in the target word library from the text file by using a natural language processing technology.

Optionally, the extracting, according to a preset target word bank, data corresponding to one or more target words in the target word bank from the text file by using a natural language processing technology includes:

performing word segmentation processing on the text file;

acquiring one or more synonyms corresponding to the target word from the word segmentation processing result corresponding to the text file;

disambiguating the one or more synonyms;

and extracting data corresponding to the target word from the text file according to the target word and one or more synonyms corresponding to the target word after the ambiguity is eliminated.

Optionally, the method further comprises:

calculating the data extraction rate and the data accuracy rate corresponding to the target words or synonyms according to the extracted data corresponding to the target words or synonyms;

and adjusting the target word bank and the regular matching rule according to the data extraction rate and the data accuracy rate.

Optionally, the data file is one or more of an HTM L file, a DOC file, a DOCX file, a PDF file, and a picture file.

To achieve the above object, according to a second aspect of the present invention, there is provided a data acquisition apparatus comprising: the system comprises a crawler rule configuration module, a data file acquisition module, a data file analysis module and a data extraction module; wherein the content of the first and second substances,

the crawler rule configuration module is used for configuring crawler rules through configuration files;

the data file analyzing module is used for analyzing the data file into a text file according to the type of the data file;

and the data extraction module is used for extracting data corresponding to one or more target words in the target word bank from the text file according to a preset target word bank.

To achieve the above object, according to a third aspect of the present invention, there is provided a data acquisition electronic device including: one or more processors; a storage device to store one or more programs which, when executed by the one or more processors, implement the method as in any of the data acquisition methods described above.

To achieve the above object, according to a fourth aspect of the present invention, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements any one of the data acquisition methods described above.

The invention has the following advantages or beneficial effects: because the technical means of configuring the crawler rules through the configuration file is adopted, the technical problem that a large amount of codes are required to be modified to adapt to the data structure of the website when data are acquired from each website is solved, the technical effect of flexibly configuring the crawler rules according to the data structure of the website is achieved, and time and labor are saved. In addition, after the data file is acquired from the website and is analyzed into the pure text file, the data is extracted from the text file by combining the regular matching rule and the natural language processing technology, so that the extracted data is more comprehensive and more accurate.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a schematic diagram of a main flow of a data acquisition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a main flow of a method for extracting data by natural language processing technology according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a main flow of another data acquisition method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the main modules of a data acquisition device according to an embodiment of the present invention;

FIG. 5 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 6 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of a main flow of a data acquisition method according to an embodiment of the present invention, and as shown in fig. 1, the data acquisition method may specifically include the following steps:

and step S101, configuring the crawler rule through the configuration file. Crawler rules include, but are not limited to: the crawling system comprises a crawler name, a crawling path, a crawling element, keywords corresponding to the crawling element and a crawling length corresponding to the crawling element. For example, data obtained from various local governments, the crawler rules are configured as follows:

the crawler is named as 'universal', the crawling path comprises a resource locator (namely UR L) http:// www.dh.gov.cn/czj/Web/search. aspx.

It can be understood that, in order to ensure the accuracy and comprehensiveness of the acquired data, a target word stock is required to be prepared in advance before the crawler rule is configured, where the target word stock includes one or more target words, and if the example is taken to obtain the total production value of each city, the corresponding target word includes a "total regional production value", and the data to be acquired is a numerical value corresponding to the "total regional production value" of each city, and in addition, UR L corresponding to a website where the "total regional production value" may be disclosed in each city is also required to be prepared.

If UR L is http:// tjj.xm. gov. cn/tjzl/ndgb, for example, the data file is acquired from the website to be crawled, then the relevant content disclosed in 20XX national economy and social development statistics of Xiamen city gazette is acquired by entering the corresponding website through the configured crawler rule by using scrpay and UR L as an entrance.

In addition, in order to increase the data acquisition speed, the title of the corresponding website page can be extracted in advance according to UR L, whether the data disclosed by the website page possibly contains related data to be acquired is judged according to the title, if so, the corresponding data file is continuously extracted, if not, the data file is directly extracted according to the next UR L, and if taking the situation that the total regional production value is still obtained as an example, whether the title extracted by UR L contains characters such as 'communique', 'financial report', 'economy', 'budget', and the like can be preliminarily judged, and whether the data file corresponding to the UR L is continuously acquired is judged.

Step S103, analyzing the data file into a text file according to the type of the data file.

The data files are one or more of HTM L files, DOC files, DOCX files, PDF files and picture files, and the data files are analyzed into plain text files by calling corresponding analysis methods for different types of data files.

Furthermore, in order to improve the accuracy of extracting data from the parsed plain text file, after the plain text file is parsed, the plain text file including the interference symbols is processed, such as the upper mark, the lower table, "/n", "/r".

And step S104, extracting data corresponding to one or more target words in the target word library from the text file according to a preset target word library. It can be understood that, due to the diversity of the data files, for analyzing the extracted data more specifically, after the corresponding text files are obtained through analysis, the text files may be sorted according to the data content or the form of the display data, for example, the property and newspaper text files may be classified into one type, and the bulletin text files may be classified into one type.

In an optional implementation manner, according to a preset target word library, data corresponding to one or more target words in the target word library is extracted from the text file by using a regular matching rule. Specifically, still taking the "total area production value" as an example, the possible expressions are "total area production value realized … …", "total area production value realized year … …", and the like, and the corresponding regular matching rule may be "total area production value: XXXX elements, "i.e., extracting content that contains a" regional production total value "followed by one or more digits.

Specifically, the extracting data corresponding to one or more target words in a target word bank from the text file by using a natural language processing technology according to a preset target word bank includes: performing word segmentation processing on the text file; acquiring one or more synonyms corresponding to the target word from the word segmentation processing result corresponding to the text file; disambiguating the one or more synonyms; and extracting data corresponding to the target word from the text file according to the target word and one or more synonyms corresponding to the target word after the ambiguity is eliminated.

The word segmentation processing refers to a process of recombining continuous word sequences into word sequences according to a certain standard; the target word is taken as population for explanation, and results such as the number of people and the number of people can be obtained after synonym expansion; the disambiguation processing of the synonym refers to the disambiguation processing of the multi-meaning times of a word according to the context of the upper and lower words; after the synonym is obtained, the number of windows, namely the target word extracted from the text file or the length of the character string before and after the synonym (such as 10) is determined, and then corresponding data are extracted from the text file according to the synonym or the target word.

In an optional embodiment, the method further comprises: calculating the data extraction rate and the data accuracy rate corresponding to the target words or synonyms according to the extracted data corresponding to the one or more target words; and adjusting the target word bank and the regular matching rule according to the data extraction rate and the data coverage rate. The extraction rate corresponding to the target word or the synonym refers to the proportion of the times of extracting data by using the target word or the synonym to the total times of extracting data by using the target word; the data accuracy corresponding to the target word or the synonym is the ratio of the number of times of extracting correct data by using the target word or the synonym to the number of times of extracting data by using the target word or the synonym. Specifically, taking the example of extracting the corresponding population number by using the target word "population", if the total number of times of data extraction by using "population" is 100, and the number of times of obtaining the character string containing "population" is 80, and the number of times of containing the population specific numerical value in the character string is 50, the data extraction rate corresponding to the target word "population" is 80%, and the data accuracy rate is 50%.

Based on this, the data extraction rate and the data accuracy rate corresponding to the target word or the synonym can reflect the quality of the target word or the synonym, the quality of the corresponding data extraction rule, and the like to a certain extent, so that the target word library or the regular matching rule can be adjusted according to the data extraction rate and the data accuracy rate corresponding to the target word or the synonym. It can be understood that, the higher the data extraction rate or the data accuracy rate is, the higher the data quality obtained by using the target word or the synonym is, so that the target word can be continuously retained in the target word bank, or the synonym can be expanded into the target word bank; if the data extraction rate is high and the data accuracy rate is low, the data extraction range is too wide or the target word or the synonym is not suitable for extracting data, and the reduction of the length of the character string obtained by using the regular matching rule or the matching in the natural language processing technology is preferably considered; and if the data extraction rate or the data accuracy rate is low, directly deleting the corresponding target words from the target word bank.

It should be noted that, in addition to adjusting the target lexicon or the regular matching rule by using the data extraction rate and the data accuracy rate, other user-defined indexes may be used to adjust the target lexicon or the regular matching rule. Therefore, the target word bank and the matching rules can be continuously improved, and the accuracy and the effectiveness of extracting the bark data can be improved.

Based on the embodiment, the technical means of configuring the crawler rules through the configuration file is adopted, so that the technical problem that a large amount of codes are required to be modified to adapt to the data structure of the website when data are acquired from each website is solved, the technical effect of flexibly configuring the crawler rules according to the data structure of the website is achieved, and time and labor are saved. In addition, after the data file is acquired from the website and is analyzed into the pure text file, the data is extracted from the text file by combining the regular matching rule and the natural language processing technology, so that the extracted data is more comprehensive and more accurate.

Referring to fig. 2, on the basis of the foregoing embodiment, an embodiment of the present invention provides a method for extracting data by using a natural language processing technology, where the method may specifically include the following steps:

step S201, performing word segmentation processing on the text file. Thus, the continuous word sequences in the text file can be recombined into word sequences according to a certain specification.

Step S202, one or more synonyms corresponding to the target word are obtained from the word segmentation processing result corresponding to the text file. Therefore, the target words can be expanded as much as possible through the word segmentation result of the text file based on the existing target words, and therefore comprehensiveness and accuracy of data extraction can be improved.

Step S203, performing disambiguation on the one or more synonyms. Specifically, the false synonyms caused by word ambiguity are eliminated in the text file by using the context of the synonyms, so that the accuracy of the expanded synonyms can be ensured.

Step S204, extracting data corresponding to the target word from the text file according to the target word and one or more synonyms corresponding to the target word after disambiguation. Specifically, the window number, i.e. the length of the target word or the character string before and after the synonym extracted from the text file (e.g. 10) may be determined, and then the corresponding data may be extracted from the text file according to the synonym or the target word.

Referring to fig. 3, on the basis of the foregoing embodiment, an embodiment of the present invention provides another data acquisition method, which may specifically include the following steps:

step S301, crawler rules are configured through configuration files.

Step S302, based on the configured crawler rule, obtaining a data file from one or more websites to be crawled according to resource locators corresponding to the websites to be crawled.

Step S303, analyzing the data file into a text file according to the type of the data file.

Step S304, extracting data corresponding to one or more target words in the target word library from the text file according to a preset target word library. On one hand, extracting data corresponding to one or more target words in a target word library from the text file by using a regular matching rule; on the other hand, data corresponding to one or more target words in the target word bank is extracted from the text file by using a natural language processing technology.

More specifically, extracting data corresponding to one or more target words in a target word library from a text file by using a natural language processing technology comprises the following steps: performing word segmentation processing on the text file; acquiring one or more synonyms corresponding to the target word from the word segmentation processing result corresponding to the text file; disambiguating one or more synonyms; and extracting data corresponding to the target words from the text file according to the target words and one or more synonyms corresponding to the target words after the ambiguity is eliminated.

Step S305, calculating the data extraction rate and the data accuracy rate corresponding to the target word or the synonym according to the data corresponding to the extracted target word or the synonym; and adjusting the target word bank and the regular matching rule according to the data extraction rate and the data accuracy rate. Therefore, the method can be continuously and more perfected, and the target word bank and the regular matching rule are improved. It can be understood that, in addition to adjusting the regular matching rule, the length of the character string extracted when the data is extracted by using the natural language processing technology may also be adjusted according to the data extraction rate and the data accuracy rate corresponding to the target word or the synonym.

Referring to fig. 4, on the basis of the foregoing embodiment, an embodiment of the present invention provides a data acquisition apparatus 400, which includes a crawler rule configuration module 401, a data file acquisition module 402, a data file parsing module 403, and a data extraction module 404; wherein the content of the first and second substances,

the crawler rule configuration module 401 is configured to configure a crawler rule through a configuration file;

the data file acquisition module 402 is configured to acquire a data file from one or more websites to be crawled according to resource locators corresponding to the websites to be crawled based on the configured crawler rule;

the data file analyzing module 403 is configured to analyze the data file into a text file according to the type of the data file;

the data extraction module 404 is configured to extract, from the text file, data corresponding to one or more target words in a target word bank according to a preset target word bank.

In an alternative embodiment, the crawler rules include: the crawling device comprises a crawler name, a crawling path, a crawling element, a keyword corresponding to the crawling element and a crawling length corresponding to the crawling element.

In an optional implementation manner, the data extraction module 404 is configured to extract, according to a preset target word library, data corresponding to one or more target words in the target word library from the text file by using a regular matching rule.

In an optional implementation manner, the extracting, according to a preset target word library, data corresponding to one or more target words in the target word library from the text file by using a natural language processing technology includes: performing word segmentation processing on the text file; acquiring one or more synonyms corresponding to the target word from the word segmentation processing result corresponding to the text file; disambiguating the one or more synonyms; and extracting data corresponding to the target word from the text file according to the target word and one or more synonyms corresponding to the target word after the ambiguity is eliminated.

In an optional implementation manner, the data extraction module 404 is further configured to calculate a data extraction rate and a data accuracy rate corresponding to the target word or the synonym according to the data corresponding to the extracted target word or the synonym; and adjusting the target word bank and the regular matching rule according to the data extraction rate and the data accuracy rate.

In an alternative embodiment, the data file is one or more of an HTM L file, a DOC file, a DOCX file, a PDF file, and a picture file.

Fig. 5 illustrates an exemplary system architecture 500 to which the data acquisition method or data acquisition apparatus of embodiments of the present invention may be applied.

As shown in fig. 5, the system architecture 500 may include terminal devices 501, 502, 503, a network 504, and a server 505. The network 504 serves to provide a medium for communication links between the terminal devices 501, 502, 503 and the server 505. Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the terminal devices 501, 502, 503 to interact with a server 505 over a network 504 to receive or send messages or the like. The terminal devices 501, 502, 503 may have various communication client applications installed thereon, such as a shopping application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The terminal devices 501, 502, 503 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 505 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the terminal devices 501, 502, 503. The background management server can analyze and process the received data such as the product information inquiry request and feed back the processing result to the terminal equipment.

It should be noted that the data obtaining method provided by the embodiment of the present invention is generally executed by the server 505, and accordingly, the data obtaining apparatus is generally disposed in the server 505.

It should be understood that the number of terminal devices, networks, and servers in fig. 5 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use with a terminal device implementing an embodiment of the invention is shown. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

To the I/O interface 605, AN input section 606 including a keyboard, a mouse, and the like, AN output section 607 including a network interface card such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 608 including a hard disk, and the like, and a communication section 609 including a network interface card such as a L AN card, a modem, and the like, the communication section 609 performs communication processing via a network such as the internet, a drive 610 is also connected to the I/O interface 605 as necessary, a removable medium 611 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted into the storage section 608 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 601.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a crawler rule configuration module, a data file acquisition module, a data file analysis module and a data extraction module. The names of these modules do not constitute a definition of the module itself in some cases, for example, the crawler rule configuration module may also be described as a "module for configuring a crawler rule by a configuration file".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: configuring a crawler rule through a configuration file; based on the configured crawler rule, acquiring a data file from one or more websites to be crawled according to resource locators corresponding to the websites to be crawled; analyzing the data file into a text file according to the type of the data file; and extracting data corresponding to one or more target words in the target word library from the text file according to a preset target word library.

According to the technical scheme of the embodiment of the invention, as the technical means of configuring the crawler rule through the configuration file is adopted, the technical problem that a large amount of codes are required to be modified to adapt to the data structure of the website when the data are acquired from each website is solved, and the technical effect of flexibly configuring the crawler rule according to the data structure of the website is achieved, so that the time and the labor are saved. In addition, after the data file is acquired from the website and is analyzed into the pure text file, the data is extracted from the text file by combining the regular matching rule and the natural language processing technology, so that the extracted data is more comprehensive and more accurate.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于大数据的网络数据抓取方法

Data acquisition method and device

相关技术

网友询问留言