Information crawling system and method for specified experts

文档序号：1520988 发布日期：2020-02-11 浏览：6次中文

阅读说明：本技术 一种针对指定专家的信息爬取系统及方法 (Information crawling system and method for specified experts ) 是由智强霍东云李燕茜张永锋李振华于 2019-10-22 设计创作，主要内容包括：本发明涉及互联网技术技术领域,尤其为一种针对指定专家的信息爬取系统及方法,包括用户交互模块、候选列表爬取模块、专家信息爬取模块、数据清洗模块、数据结构化模块和专家画像模块,本发明通过设计解决现有的网络爬虫技术中,将网络上的专家数据集中爬取后存在数据库和云端中,但是这种方法需要占用较多的存储空间和服务器资源,也需要在数据的全面性和维护成本之间进行取舍,从而能准确的返回用户指定的专家信息,无需在搜索引擎中进行搜索和确认,节省了用户的时间成本,能整合多个数据源的专家数据,避免了在单一数据源中搜索不到对应专家的问题同时将大量专家数据存储在数据库和云端的方式,能够节省大量的数据库空间和维护成本。(The invention relates to the technical field of internet, in particular to an information crawling system and a method for a specified expert, which comprises a user interaction module, a candidate list crawling module, an expert information crawling module, a data cleaning module, a data structuring module and an expert portrait module, and solves the problems that in the prior network crawler technology, expert data on a network is crawled in a centralized manner and then exists in a database and a cloud, but the method needs to occupy more storage space and server resources and needs to be selected between the comprehensiveness of the data and the maintenance cost, so that the expert information specified by a user can be accurately returned, searching and confirmation in a search engine are not needed, the time cost of the user is saved, the expert data of a plurality of data sources can be integrated, the problem that corresponding experts cannot be searched in a single data source is avoided, and a large amount of expert data are stored in the database and the cloud, a large amount of database space and maintenance costs can be saved.)

1. The system and the method are characterized by comprising a user interaction module (1), a candidate list crawling module (2), an expert information crawling module (3), a data cleaning module (4), a data structuring module (5) and an expert portrait module (6).

2. The system and method for information crawling for specific experts according to claim 1, characterized by comprising the following steps:

s1, inputting the name of the expert to be searched by the user, selecting the specificity of the mechanism where the expert is added to promote data, reading the input of the user by the candidate list crawling module (2), crawling the candidate expert list through the academic expert website, returning a certain number of candidate lists to the user interaction module (1), and finally confirming the appointed expert by the user;

s2, the expert information crawling module (3) receives expert data specified by a user as input, crawls the expert data in a search engine and an academic search website by using related keywords, and directly transmits the crawled data stream to the data cleaning module (3) and the data structuring module (4);

s3, the data cleaning module (4) receives the data stream of the expert information crawling module (3), data noise reduction and cleaning are carried out through a regular expression, a strip function, a place function and the like, the data structuring module (5) analyzes the data stream, a corresponding analysis method is formulated according to different source websites and different page types, and the analyzed data are stored in a database;

s4, after receiving the structure data, the expert portrait module (6) fills the expert portrait according to the prepared canvas template, and returns the filled user portrait to the user interaction module (1) for the user to confirm, and the process is finished after confirming.

3. The system and method for information crawling for specific experts of claim 2, wherein: the academic specialist websites in S1 include websites of google scholars, hundredth scholars, research gate, and the like.

4. The system and method for information crawling for specific experts of claim 1, wherein: and crawling an open source framework Scapy adopting Python in the S2, and using an Agent pool and a User-Agent pool.

5. The system and method for information crawling for specific experts of claim 2, wherein: the cleaning in S3 includes cleaning meaningless escape characters, mark symbols, blank lines, carriage returns and other data.

6. The system and method for information crawling for specific experts of claim 1, wherein: the data parsed in S3 includes: expert name, expert organization, source web site, expert ID, research field, thesis, total quoted, number of fruits, h-index, g-index, etc.

Technical Field

The invention relates to the technical field of internet technology, in particular to an information crawling system and method for a specified expert.

Background

The internet, also called internet, is a huge network formed by connecting networks in series according to transliteration, and these networks are connected by a set of general protocols to form a logically single and huge global network, in which there are network devices such as switches, routers, etc., various connection links, various servers and endless computers, terminals. The internet is used for instantly sending information to people beyond thousands of miles, the internet is a foundation of an information society, with the rise and development of the internet, the information quantity of the network is continuously increased, the data quantity of experts and scholars is more and more huge in academic circles, people mainly search expert data through search engines, but all information of experts interested by people is difficult to directly find through a general search engine, in addition, some search engines special for academic also have the problem of incomplete expert libraries, in the existing web crawler technology, a general method is to crawl the expert data on the network in a centralized manner and then store the expert data in a database and a cloud, but the method needs to occupy more storage space and server resources, and also needs to give up trade off between the comprehensiveness of the data and the maintenance cost.

In summary, the present invention provides an information crawling system and method for a specific expert to solve the existing problems.

Disclosure of Invention

The invention aims to provide an information crawling system and method for a specified expert to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

an information crawling system and method for a specified expert comprise a user interaction module, a candidate list crawling module, an expert information crawling module, a data cleaning module, a data structuring module and an expert portrait module.

Preferably, the method comprises the following steps:

s1, inputting the name of the expert to be searched by the user, selecting the specificity of the mechanism where the added expert is located to promote data, reading the input of the user by the candidate list crawling module, crawling the candidate expert lists through the academic expert website, returning a certain number of candidate lists to the user interaction module, and finally confirming the appointed expert by the user;

s2, the expert information crawling module receives the expert data specified by the user as input, crawls the expert data in a search engine and an academic search website by using related keywords, and directly transmits the crawled data stream to the data cleaning module and the data structuring module;

s3, the data cleaning module receives the data stream of the expert information crawling module, data noise reduction and cleaning are carried out through a regular expression, a strip function, a place function and the like, the data structuring module analyzes the data stream, a corresponding analysis method is formulated according to different source websites and page types, and the analyzed data are stored in a database;

and S4, after receiving the structured data, the expert portrait module fills the expert portrait according to the prepared canvas template, returns the filled user portrait to the user interaction module for the user to confirm, and ends the process after confirming.

Preferably, the academic specialist website in S1 includes websites of google scholars, hundredth scholars, Research Gate, and the like.

Preferably, the open source framework script adopting Python is crawled in S2, and a proxy pool and a User-Agent pool are used.

Preferably, the washing in S3 includes washing meaningless escape characters, symbols, empty lines, carriage returns, and the like.

Preferably, the data parsed in S3 includes: expert name, expert organization, source web site, expert ID, research field, thesis, total quoted, number of fruits, h-index, g-index, etc.

Compared with the prior art, the invention has the beneficial effects that:

1. in the invention, the problems that in the network crawler technology, the expert data on the network is intensively crawled and then stored in a database and a cloud end in a common method are solved through design, but the method needs to occupy more storage space and server resources and also needs to accept or reject between the comprehensiveness of the data and the maintenance cost, the expert information specified by a user can be accurately returned, the search and confirmation in a search engine are not needed, and the time cost of the user is saved.

2. According to the invention, the problem that corresponding experts cannot be searched in a single data source is avoided by designing the expert data capable of integrating a plurality of data sources.

3. According to the invention, a large amount of database space and maintenance cost can be saved by designing a mode of storing a large amount of expert data in the database and the cloud.

Drawings

FIG. 1 is a block diagram of the system of the present invention;

FIG. 2 is a schematic diagram of the system flow structure of the present invention.

In the figure: the system comprises a 1-user interaction module, a 2-candidate list crawling module, a 3-expert information crawling module, a 4-data cleaning module, a 5-data structuring module and a 6-expert portrait module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.

Referring to fig. 1-2, the present invention provides a technical solution:

an information crawling system and method for a specified expert, comprising a user interaction module 1, a candidate list crawling module 2, an expert information crawling module 3, a data cleaning module 4, a data structuring module 5 and an expert portrait module 6, wherein the user interaction module 1 is used for processing the input of a user and outputting and displaying the processing result of a background to the user, the candidate list crawling module 2 is used for crawling experts meeting requirements in different network expert data websites according to the expert information input by the user and returning the experts to the user interaction module for further selection, the expert information crawling module 3 is used for crawling data in a plurality of academic expert websites according to the specified expert information selected by the user, the data cleaning module 4 is used for carrying out noise reduction processing on the data acquired by the expert information crawling module 3, the data structure module 5 is used for analyzing the web page obtained by the expert information crawling module 3 to obtain formatted data, matching different structure methods for different data source websites, and the expert image module 6 is used for filling different types of data of the structured data into canvas of an image, finally forming the image of an expert and returning the result to a user.

The working process of the invention is as follows:

step 1, inputting the name of an expert to be searched by a user, selecting the specificity of data promotion of a mechanism where the expert is added, reading the input of the user by a candidate list crawling module 2, crawling candidate expert lists on websites such as Google academy, Baidu academy, research gate and the like, returning a certain number of candidate lists to a user interaction module 1, and finally confirming the appointed expert by the user;

step 2, the expert information crawling module 3 receives expert data specified by a User as input, crawls a search engine and an academic search website by using related keywords, crawls an open source frame script adopting Python, and directly transmits the crawled data stream to the data cleaning module 3 and the data structuring module 4 by using an Agent pool and a User-Agent pool;

step 3, the data cleaning module 4 receives the data stream of the expert information crawling module 3, data noise reduction and cleaning are carried out through a regular expression, a strip function, a place function and the like, data such as meaningless escape characters, mark symbols, empty rows and carriage returns are mainly cleaned, the data structuring module 5 analyzes the data stream, a corresponding analysis method is formulated according to different source websites and page types, the analyzed data are stored in a database, and the analyzed data comprise: expert names, expert institutions, source websites, expert IDs, research fields, treatises, total quoted numbers, fruit numbers, h-index indices, g-index indices, and the like;

step 4, after the expert portrait module 6 receives the structured data, the expert portrait is filled according to a canvas template prepared in advance, the filled user portrait is returned to the user interaction module 1 for confirmation by the user, and after confirmation, the process is finished, the process solves the problem that in the existing web crawler technology, the expert data on the network is intensively crawled and then exists in a database and a cloud terminal, but the method needs to occupy more storage space and server resources and needs to give up between the comprehensiveness of the data and the maintenance cost, so that the expert information appointed by the user can be accurately returned, the search and confirmation in a search engine are not needed, the time cost of the user is saved, the expert data of a plurality of data sources can be integrated, the problem that the corresponding expert cannot be searched in a single data source is avoided, and a large amount of expert data are stored in the database and the cloud terminal at the same time, a large amount of database space and maintenance costs can be saved.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

7页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于依存句法和泛化因果网络进行情感原因挖掘方法

Information crawling system and method for specified experts

相关技术

网友询问留言