Video acquisition method and system based on browser

文档序号:1395823 发布日期:2020-02-28 浏览:9次 中文

阅读说明:本技术 一种基于浏览器的视频获取方法及系统 (Video acquisition method and system based on browser ) 是由 夏光升 孙涛 于 2019-10-31 设计创作,主要内容包括:本发明属于互联网技术领域,特别涉及一种基于浏览器的视频获取方法及系统。所述视频获取方法包括:获得非视频网络地址数据,根据所述非视频网络地址,下载网络页面中的视频数据;对完成下载的所述视频数据进行去重复化处理,并进行保存。实现对静态页面和动态页面中视频的下载。(The invention belongs to the technical field of internet, and particularly relates to a video acquisition method and system based on a browser. The video acquisition method comprises the following steps: acquiring non-video network address data, and downloading video data in a network page according to the non-video network address; and performing de-duplication processing on the downloaded video data, and storing the video data. And the downloading of the videos in the static page and the dynamic page is realized.)

1. A video acquisition method based on a browser is characterized by comprising the following steps:

obtaining non-video network address data;

capturing video network address data in a network page by using a browser automatic test method according to the non-video network address data;

and downloading video data and storing the video data according to the video network address data.

2. The video acquisition method according to claim 1, wherein said downloading video data comprises:

carrying out IO stream processing on the video network address data;

and downloading the video data in the webpage through the IO stream.

3. The video acquisition method according to claim 1,

the method for automatically testing capture by using the browser comprises the following steps:

calling ChromeOptions to set a rendering mode;

and calling ChromeDriver to load a browser and acquiring a video network address.

4. The video acquisition method of claim 3, wherein said crawling the video network address data in the network page further comprises:

and capturing video network address data in the network page, and capturing video network address data in a link network page of the network page.

5. The video acquisition method according to claim 1, wherein said downloading video data further comprises:

and performing video de-duplication operation on the downloaded video.

6. The video acquisition method according to claim 1,

if the video network address data in the network page is failed to be captured, automatically capturing again;

stopping grabbing if the automatic re-grabbing times exceed a re-climbing threshold;

and storing the network address data corresponding to the network page into a database server, and waiting to be called again for execution.

7. The video acquisition method according to claim 1,

if the downloading of the video data fails, automatically downloading again;

stopping downloading if the automatic downloading times exceed a lower threshold;

and storing the video network address data corresponding to the video data into a database server, and waiting to be called again for execution.

8. The video acquisition method according to claim 1,

and the video network address data in the captured page and the downloaded video data are automatically executed.

9. The video acquisition method according to claim 1, wherein the video acquisition method further comprises:

automatically logging in a website;

capturing video network address data in a network page after logging in;

and downloading the video data after logging in.

10. A browser-based video capture system, the video capture system comprising:

the task generating unit is used for generating a video capturing task according to the obtained non-network address data;

the video capturing unit is used for executing the video capturing task and capturing a video network address in a webpage;

the video downloading unit is used for downloading the video data in the webpage according to the video network address;

and the database server is used for storing the video data which is subjected to the de-duplication processing, storing the network address data corresponding to the uncompleted webpage grabbing task, and storing the video network address data corresponding to the video data downloading failure.

11. The video acquisition system of claim 10, further comprising:

the video duplication removing unit is used for carrying out duplication removing processing on the downloaded video data;

the network address data sorting unit is used for sorting the network address data and the video network address data;

and the login unit is used for automatically logging in the website.

Technical Field

The invention belongs to the technical field of internet, and particularly relates to a video acquisition method and system based on a browser.

Background

With the rapid development of computer technology, network technology and multimedia technology in recent years, media acquisition, transmission and release modes are greatly changed, and cloud computing and multimedia are widely applied. More and more users have become available to facilitate network video communication, subscribe to and play multimedia data on the network. These applications necessitate security monitoring of network multimedia. Multimedia security covers multiple aspects of politics, economy, technology, culture, human life, and the like. Due to the dominance taken up by multimedia content information security, other security modalities have to a large extent extended from it. However, due to the characteristics of dispersion, strong dynamic change and the like of the multimedia network nodes, the multimedia network nodes are difficult to manage and have poor safety, and therefore, the violation video detection and analysis is significant.

Before detecting the video in the internet, the video data in the internet needs to be acquired. Because there is a huge amount of network address data in the internet, which contains a large amount of video network address data, the corresponding video data is large in amount and many in types, and the implementation technologies of different websites are different. Some of the different video network address data also correspond to the same video data. Most web sites already employ anti-crawler systems. Therefore, the difficulty of automatically downloading the video data in the internet in a large batch is high.

Disclosure of Invention

In order to solve the above problem, the present invention provides a video acquisition method based on a browser, where the video acquisition method includes:

obtaining non-video network address data;

capturing video network address data in a network page by using a browser automatic test method according to the non-video network address data;

and downloading video data and storing the video data according to the video network address data.

Further, the downloading the video data includes:

carrying out IO stream processing on the video network address data;

and downloading the video data in the webpage through the IO stream.

Further, the grabbing by using the browser automation testing method comprises the following steps:

calling ChromeOptions to set a rendering mode;

and calling ChromeDriver to load a browser and acquiring a video network address.

Further, the capturing the video network address data in the network page further includes:

and capturing video network address data in the network page, and capturing video network address data in a link network page of the network page.

Further, the video operation is deduplicated on the downloaded video.

Further, if the capturing of the video network address data in the network page fails, automatically re-capturing;

stopping grabbing if the automatic re-grabbing times exceed a re-climbing threshold;

and storing the network address data corresponding to the network page into a database server, and waiting to be called again for execution.

Further, if the downloading of the video data fails, the video data is automatically downloaded again;

stopping downloading if the automatic downloading times exceed a lower threshold;

and storing the video network address data corresponding to the video data into a database server, and waiting to be called again for execution.

Further, the video network address data in the crawled pages and the downloaded video data are automatically executed.

Further, the video acquisition method further includes:

automatically logging in a website;

capturing video network address data in a network page after logging in;

and downloading the video data after logging in.

The invention also provides a video acquisition system based on the browser, which is characterized by comprising the following components:

the task generating unit is used for generating a video capturing task according to the obtained non-network address data;

the video capturing unit is used for executing the video capturing task and capturing a video network address in a webpage;

the video downloading unit downloads video data in a webpage according to the video network address;

and the database server is used for storing the video data which is subjected to the de-duplication processing, storing the network address data corresponding to the uncompleted webpage grabbing task, and storing the video network address data corresponding to the video data downloading failure.

Further, the video acquisition system further comprises:

the video duplication removing unit is used for carrying out duplication removing processing on the downloaded video data;

the network address data sorting unit is used for sorting the network address data and the video network address data;

and the login unit is used for automatically logging in the website.

The invention can realize downloading of massive video data, and has strong processing capability and reliable system. The system can be adaptively and automatically logged in to download according to actual needs, and the downloading of videos in static pages and dynamic pages is realized.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a video acquisition method based on a browser according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram illustrating a video capturing system based on a browser according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses a video acquisition method based on a browser, which comprises the following specific steps as shown in figure 1:

step one, obtaining non-video network address data.

Specifically, the step of obtaining the non-video network address data is as follows:

step 1.1, network address data is obtained;

illustratively, there are various methods of obtaining network address data. The following may be used and is not limited to obtaining.

The method comprises the following steps: an Internet Data Center (IDC) access log can automatically record a Uniform Resource Locator (URL) of browsing and accessing by a general public user, and the URL Data is network address Data. The network address data can be extracted by obtaining the log file from the internet data center.

The second method comprises the following steps: the Internet (Internet) is a collection of global information resources, a global computer network. The World Wide Web (WWW) is a global information resource network on the Internet that integrates multimedia information such as text, sound, image, video, etc., and is an important component on the Internet. The world wide web uses hypertext Markup Language (HTML), and the Text using the HTML is hypertext. The hypertext includes not only text information but also multimedia information such as graphics, sound, image, video, and the like. The hypertext also contains Links to other hypertext, which are called hyperlinks (Hyper Links). All hyperlinks are collected by accessing a page on the internet. And opening a new page according to the collected hyperlinks, and collecting the hyperlinks in the new page. The collection is carried out in a circulating way until the number of collected hyperlinks reaches the requirement. And the collected hyperlink combination is the network address data.

The third method comprises the following steps: a large amount of website information is recorded on a Domain Name System (DNS). And capturing all hyperlinks of the website corresponding to the address by collecting the address information in the domain name system. The website information combination is network address data; the collection of hyperlink combinations is more detailed network address data.

The method four comprises the following steps: a large amount of website information is recorded on the national Internet security management service platform. And performing hyperlink grabbing on the website corresponding to the address by collecting website information on the platform. The website information combination is network address data; the collection of hyperlink combinations is more detailed network address data.

And step 1.2, finding out and recording video network address data in the network address data, wherein the remaining address data are non-network address data.

Specifically, the suffix of the video network address data is a video file type, for example: mp4,. avi,. rm, etc. And finding out video network address data in the network address data according to the suffix of the network address data.

Illustratively, the video network address data may be found using, but not limited to, the following. Suffixes of video network address data such as mp4,. avi,. rm, etc. may be collected in advance and stored in a dictionary table. Comparing the network address data suffix with the data in the dictionary table. And when the network address data suffix is in the dictionary table, the network address data is considered as video network address data.

For example, in a certain page of a certain website, the following network addresses exist: https:// p3.img. cctvpic. com/uploadimg/mp 4/2019/10/15/PtTfw8OhZsarGy5xyXo191015. mp4. The network address suffix is.mp 4. And comparing the suffix with data in the dictionary table, and if the suffix exists in the dictionary table, considering that the network address is a video network address, and the corresponding video file type is mp4 type.

And step 1.3, sorting the video network address data.

And deleting data which does not conform to the correct format of the video network address in the video network address data.

And sorting the video network address data and deleting the repeated video network address data.

Illustratively, the default use in a video network address is the half-size english character. When a full-angle character appears in an address, the address can be considered not to conform to the correct format of the video network address. The address is deleted.

Illustratively, the video network address does not contain certain characters, such as "@". When the "@" symbol appears in an address, the address may be considered to not conform to the correct format for the video network address. The address is deleted.

Specifically, the sorted video network address data is the first video network address data. And deleting the first video network address data from the network address data, and remaining non-video address data.

Step two: and capturing video network address data in the webpage by using a browser automatic testing method according to the non-video network address data.

And classifying the non-video address data by using a Kafka stream processing platform and a Storm distributed real-time computing system to generate a webpage grabbing task.

Specifically, the above-mentioned producer (Kafka data source) is written into the Topic class of the Kafka stream processing platform, and the Kafka stream processing platform distributes the information stream of the Topic class to each node and flows into the Storm cluster to wait for consumption. Illustratively, the Kafka streaming platform distributes the non-video network address data to the various node servers in the cluster. And each node server acquires the non-video network address data through a Spout of the Storm distributed real-time computing system, and the Spout distributes the non-video network address data to different bolts. Bolt processes the network address data and stores the video network address data through IO stream; and generating a webpage grabbing task according to the non-video network address data.

Specifically, the Kafka streaming platform is an open-source distributed streaming media platform, and is a fault-tolerant message system based on publish-subscribe. It mainly includes three functions:

(1) publish and subscribe to a stream of messages, the function of which is similar to a message queue.

(2) Recording the message flow in a fault-tolerant manner; specifically, the Kafka stream processing platform stores the message stream in a file manner.

(3) Messages may be processed during the process of publication of the message.

Specifically, the Storm distributed real-time computing system is an open source system, and can process streaming data in real time. The core technologies of Storm distributed real-time computing systems are divided into two types: spout (nozzle) and Bolt.

Spout is the source of the stream in the Storm distributed real-time computing system and is the message producer of the topology (topology). Typically, Spout reads tuple data (tuple, e.g., web crawling task) from an external data source (e.g., message queue output by Kafka streaming platform) and feeds it into the topology. All the computation logic in the topology is implemented in Bolt. A Bolt can process any number of input streams, producing any number of output streams (IO streams). Bolt can do function processing, filtering, merging of streams, aggregation, storing to a database and the like. A plurality of spits and a plurality of bolts can be arranged in the Storm distributed real-time computing system, and information cross transmission can be realized among the plurality of spits and the plurality of bolts.

Specifically, the Kafka stream processing platform distributes and processes network address data, and the Storm distributed real-time computing system downloads video data.

By way of example, and not limitation, video network address data in a web page may be crawled in the following manner.

And grabbing an automatic testing method by using a browser.

In particular, the browser automation test framework (Selenium) is a tool for Web application testing. The browser automation test method refers to a method that runs directly in a browser using a browser automation test framework (Selenium). The browser automation test method can imitate the operation mode of a user on the browser. Browsers supported by the browser Automation test framework (Selenium) include IE, Mozilla Firefox, Safari, Google Chrome, Opera, and the like.

Illustratively, the browser automation test method uses, but is not limited to, the Selenium + Google Chrome technology.

The method adopts a webpage information capturing technology of a browser automatic test frame (Selenium) and Google Chrome browser to realize the acquisition of video network address data in a page.

In particular, most websites use anti-web crawler technology for preventing crawlers from crawling the web.

Illustratively, when a User accesses the website by using a browser, the User-Agent field of the access request page of the User displays information of the browser of the User accessing the website. If the website does not find browser information in the User-Agent field, the website may consider the User to be using crawler technology and deny the User access.

Specifically, the browser automation test framework (Selenium) can drive the browser to execute requests and operations, and signals for executing the requests and operations are not originated from a mouse or a keyboard, but are originated from an Application Program Interface (API) of the browser automation test framework (Selenium). The automatic browser test framework can be realized by driving the browser through all operations of a natural human user on a network page through computer input equipment. The operation on the web page includes input, clicking, sliding, verification, and the like. The Selenium + Google Chrome mode can well simulate the operation of natural people on the website, so that the website considers that the website is visited by natural people instead of crawlers, and the website cannot be denied.

Illustratively, a browser automation test framework (Selenium) is run on the Google Chrome browser. And the browser automatic test framework opens a network page corresponding to the webpage grabbing task on the browser according to the webpage grabbing task. And if the video data exist on the network page, acquiring video network address data corresponding to the network page. And if no video data exists in the network page, the automatic browser test framework automatically navigates the browser to enter other link pages of the network page. And if the video data exist in the link page, acquiring video network address data corresponding to the link page. And if the link page has no video data, the automatic browser test framework automatically navigates the browser to enter other link pages of the link network page. By the circulation, the webpage information capturing technology of the browser automatic test frame (Selenium) + Google Chrome browser realizes the acquisition of the video network address data in the page.

Video network address data acquisition uses both ChromeOptions and ChromeDriver objects. Wherein, the Selenium calls ChromeOptions to set rendering modes such as JS, CSS and the like; and calling ChromeDriver, loading a browser by using a getPageSource function in the ChromeDriver, and acquiring a video network address.

And sorting the acquired video network address data, deleting the data which does not conform to the correct format of the network address, and taking the remaining video network address data as second video network address data.

And carrying out de-duplication processing on the first video network address data and the second video network address data, wherein the processed data is third video network address data.

In the process of page grabbing, part of pages cannot be opened for various reasons, and grabbing cannot be performed after the part of pages are opened. Such web pages are automatically re-crawled until a re-crawl threshold is exceeded. And archiving the network address data exceeding the re-crawling threshold for later iterative capture.

Illustratively, a browser automation test framework crawls certain web page content on a browser. And after the grabbing fails, the browser automatic test framework controls automatic re-grabbing, and after the grabbing fails, the browser automatic test framework controls re-grabbing again. Assuming that the re-crawling threshold of the network page is 4, namely after four continuous re-crawling failures, archiving and storing the network address data of the network page into the database server, waiting for being called again in the later stage, and re-performing the iterative crawling of the content of the network page.

Specifically, for a part of websites, capturing video network address data in a network page or downloading video data requires logging in.

The login steps are as follows:

and 2.1, obtaining the account number and the password.

For example, there are various ways to obtain the account password. The following may be used but is not limited to: manually registered and recorded or automatically registered and recorded using a computer.

And 2.2, obtaining a login page structure.

Specifically, the login page structure includes various items of information required for login.

For example, many websites design login verification methods to prevent automatic login using a machine. The login page structure includes a login authentication means. Some website login verification methods are picture identification, and some website login verification methods are calculation. At this time, a specific login authentication mode needs to be obtained.

And 2.3, logging in by using an account password according to the obtained login page structure.

Specifically, login is performed by using an account password according to the obtained login page. Login verification automatic login verification may be accomplished using, but is not limited to, the following: capture technology, cookie technology.

Illustratively, the login may be performed using, but is not limited to, the following: using Selenium, a WebElement object was called for simulated login.

And 2.4, after logging in, further operating the website.

Specifically, the further operation includes capturing video network address data in the network page and downloading the video data.

Step three: and downloading video data and storing the video data according to the video network address data.

The downloading the video data includes downloading the video data and de-duplicating the downloaded video.

And downloading the video data, namely carrying out IO stream processing on the third video network address data.

Illustratively, IO stream processing may be performed in, but is not limited to, the following manner.

The Storm distributed real-time computing system establishes a link with a target address, reads and downloads binary data of a video from the link in an IO stream mode, and then writes the binary data into a local file.

Further, rendering a browser page by using a browser automation test frame (Selenium), and downloading video data corresponding to the third video network address data through an IO stream.

In the process of downloading the video file, part of the video cannot be started to be downloaded due to various reasons, and the downloading fails after the part of the video is started to be downloaded. Such videos are automatically re-downloaded until a lower threshold is exceeded. And archiving the video network address data exceeding the lower threshold value for later iterative downloading.

Illustratively, video data corresponding to the video network address data is downloaded. And after the downloading fails, automatically downloading again. Assuming that the lower threshold of the video data is 5, namely after five continuous re-downloads fail, archiving and storing the video network address data of the video data in the database server, waiting for being called again in the later stage, and re-performing iterative download.

The automatic test frame technology of the browser is utilized, and the method is not limited to downloading videos in the webpage and can also be used for downloading audio, pictures, characters and other data in the webpage.

The downloaded video files are stored in mp4, avi, rm or other video formats in a database server that may employ, without limitation, the following systems: the system comprises a distributed file system (HDFS), a data warehouse tool (Hive), an Hbase distributed storage system and other Hadoop cluster servers.

Illustratively, the first website publishes the first video, and the second website publishes the first video on the website of the second website after downloading. At this time, two different video network address data exist on the site a and the site b, and the same video corresponds to the two different video network address data. The same video can be acquired by acquiring the videos of the website A and the website B, and the downloaded video needs to be deduplicated at the moment.

The following methods may be used, but are not limited to, to deduplicate video files.

The following are exemplary: the de-duplication method is as follows:

step 3.1, recording the following data of the video file: file name, suffix name, file size.

Step 3.2, comparing the file name, the suffix name and the file size; if the three items are the same, the video is considered to be a duplicate video (further, a video MD5 check value is calculated, and when the three items are the same and the MD5 check value is also the same, the video is considered to be a duplicate video).

And 3.3, deleting the repeated video, and recording the video network address data corresponding to the deleted video.

In order to realize the video acquisition method, the invention also provides a video acquisition system based on the browser. The video acquisition system may employ, but is not limited to, the following structure, as shown in fig. 2. The video acquisition system includes: the system comprises a task generating unit, a video capturing unit, a video downloading unit, a video duplicate removal unit, a database server, a network address data sorting unit and a login unit.

The task generating unit receives the non-video network address data and outputs a video capturing task to the video capturing unit; and the video capture task is generated according to the acquired non-video network address data.

The video capturing unit receives the video capturing task transmitted by the task generating unit and outputs second video network address data to the data sorting unit; and the video crawling task is used for executing the video crawling task and crawling the video network address in the webpage. The video capture unit also outputs network address data corresponding to the uncompleted web capture task to the database server.

The network address data sorting unit is used for sorting the video network address data, receiving the first video network address data and the second video network address data transmitted by the video capturing unit, and outputting third video network address data to the video downloading unit; for collating the network address data and the video network address data. The sorting includes deleting duplicate data, deleting data that does not conform to the correct format of the network address.

The video downloading unit receives the third video network address data transmitted by the network address data sorting unit and outputs the video data to the video duplication removing unit; and the video server is used for downloading the video data in the webpage according to the video network address data. The video downloading unit also outputs video network address data corresponding to the failure of downloading the video data to the database server.

The video duplication removing unit receives the video data transmitted by the video downloading unit and outputs the duplicated video data to the database; for de-duplicating the downloaded video data.

The database server receives data transmitted by the video capture unit, the video download unit and the video deduplication unit, and is used for storing video data which is subjected to deduplication processing, storing network address data corresponding to uncompleted webpage capture tasks, and storing video network address data corresponding to video data download failure.

The login unit is used as required. The video capturing unit or the video downloading unit is used for calling and automatically logging in the website.

The video acquisition system provided by the invention can use a single host computer for downloading, and can also be provided with a plurality of servers for downloading. And a plurality of servers are adopted for carrying out distributed downloading of massive videos, and the distributed downloading has the advantages of good expansibility and strong reliability. The reliability of the system is greatly improved by the operation of a plurality of servers, and meanwhile, the scale of the system can be efficiently and dynamically expanded according to the increase and decrease of the required scale, so that the number of the servers in the system can be timely adjusted.

The video acquisition system takes a browser automatic test framework (Selenium) as a control console, a search source needing to be crawled is divided into a plurality of webpage grabbing tasks according to a strategy, and the control console transmits the webpage grabbing tasks to all cloud crawling servers in a file form. And each server receives the webpage grabbing task and then carries out crawling, and meanwhile, stores a crawling result to the database cluster server. The console summarizes and summarizes the data crawled by the crawler servers and provides foreground data display for the user.

The video acquisition system can be stored in a single host or a file storage server, and can also store downloaded videos in a cloud storage mode. The cloud storage refers to a system which integrates a large number of storage devices of various types in a network through application software to cooperatively work through functions such as cluster application, network technology or a distributed file system and provides data storage and service access functions to the outside. The cloud storage can be used for improving the stability and the safety of system data.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:视频画质增强传输方法以及相关装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类