您的位置:首页 > Web前端 > JavaScript

Web Scraping Ajax and Javascript Sites

2015-05-06 10:09 603 查看
转自:http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/


Introduction

Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but
it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to
explore the alternatives in more depth.

There are several ways to scrape a site that contains Javascript:

Embed a web browser within an application and simulate a normal user.

Remotely connect to a web browser and automate it from a scripting language.

Use special purpose add-ons to automate the browser

Use a framework/library to simulate a complete browser.

Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.

In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation
in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.


Setting up the environment


Prerequisites

JRE or JDK.

Download the latest version of Jython fromhttp://www.jython.org/downloads.html.

Run the .jar file and install it in your preferred directory (e.g: /opt/jython).

Download the htmlunit compiled binaries from:http://sourceforge.net/projects/htmlunit/files/.

Unzip the htmlunit to your preferred directory.


Crawling example

We will scrape the Gartner Magic Quadrant pages at:http://www.gartner.com/it/products/mq/mq_ms.jsp .
If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.

gartner.py

01
import

com.gargoylesoftware.htmlunit.WebClient as WebClient
02
import

com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion
03
04
def

main():
05
webclient
=

WebClient(BrowserVersion.FIREFOX_3_6)
#
creating a new webclient object.
06
url
=

"http://www.gartner.com/it/products/mq/mq_ms.jsp"
07
page
=

webclient.getPage(url)
#
getting the url
08
articles
=

page.getByXPath(
"//table[@id='mqtable']//tr/td/a"
)
#
getting all the hyperlinks
09
10
for

article
in

articles:
11
print

"Clicking on:"
,
article
12
subpage
=

article.click()
#
click on the article link
13
title
=

subpage.getByXPath(
"//div[@class='title']"
)
#
get title
14
summary
=

subpage.getByXPath(
"//div[@class='summary']"
)
#
get summary
15
if

len
(title)> 
0

and
len
(summary)> 
0
:
16
print

"Title:"
,
title[
0
].asText()
17
print

"Summary:"
,
summary[
0
].asText()
18
#
break
19
20
if

__name__
=
=

'__main__'
:
21
main()


run.sh

1
/opt/jython/jython
-J-classpath
"htmlunit-2.8/lib/*"

gartner.py


Final notes

This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more
web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.

If you want to be polite don’t forget to read the robots.txt file before crawling…


If you like this article, you might also be interested in

Distributed
Scraping With Multiple Tor Circuits

Precise
Scraping with Google Chrome

Running
Your Own Anonymous Rotating Proxies

Automated
Browserless OAuth Authentication for Twitter


Resources

HtmlUnit

ghost.py is a webkit web client written
in python

Crowbar web scraping environment

Google Chrome
remote debugging shell from Python

Selenium web application testing systemWatirSahi – Windmill
Testing Framework

Internet
Explorer automation

jSSh Javascript Shell Server
for Mozilla

http://trac.webkit.org/wiki/QtWebKit

Embedding
Gecko

Opera Dragonfly

PyAuto: Python Interface
to Chromum’s automation framework

Related
questions on Stack Overflow

Scrapy

EnvJS: Simulated browser environment written
in Javascript

Setting
up Headless XServer and CutyCapt on Ubuntu

CutyCapt: Capture WebKit’s rendering
of a web page.

Google
webmaste blog: A spider’s view of Web 2.0

OpenQA

Python Webkit DOM Bindings

Berkelium Browser

uBrowser

Using
HtmlUnit on .NET for Headless Browser Automation (using IKVM)Zombie.js

PhantomJS

PyPhantomJS

CasperJS

Web Inspector Remote

Offscreen/Headless Mozilla
Firefox (via @brutuscat)Web
Scraping with Google Spreadsheets and XPath

Web
Scraping with YQL and Yahoo Pipes
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: