Web Scraping Ajax and Javascript Sites
2015-05-06 10:09
603 查看
转自:http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/
Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but
it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to
explore the alternatives in more depth.
There are several ways to scrape a site that contains Javascript:
Embed a web browser within an application and simulate a normal user.
Remotely connect to a web browser and automate it from a scripting language.
Use special purpose add-ons to automate the browser
Use a framework/library to simulate a complete browser.
Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.
In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation
in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.
JRE or JDK.
Download the latest version of Jython fromhttp://www.jython.org/downloads.html.
Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
Download the htmlunit compiled binaries from:http://sourceforge.net/projects/htmlunit/files/.
Unzip the htmlunit to your preferred directory.
We will scrape the Gartner Magic Quadrant pages at:http://www.gartner.com/it/products/mq/mq_ms.jsp .
If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.
This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more
web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.
If you want to be polite don’t forget to read the robots.txt file before crawling…
Distributed
Scraping With Multiple Tor Circuits
Precise
Scraping with Google Chrome
Running
Your Own Anonymous Rotating Proxies
Automated
Browserless OAuth Authentication for Twitter
HtmlUnit
ghost.py is a webkit web client written
in python
Crowbar web scraping environment
Google Chrome
remote debugging shell from Python
Selenium web application testing system – Watir – Sahi – Windmill
Testing Framework
Internet
Explorer automation
jSSh Javascript Shell Server
for Mozilla
http://trac.webkit.org/wiki/QtWebKit
Embedding
Gecko
Opera Dragonfly
PyAuto: Python Interface
to Chromum’s automation framework
Related
questions on Stack Overflow
Scrapy
EnvJS: Simulated browser environment written
in Javascript
Setting
up Headless XServer and CutyCapt on Ubuntu
CutyCapt: Capture WebKit’s rendering
of a web page.
Google
webmaste blog: A spider’s view of Web 2.0
OpenQA
Python Webkit DOM Bindings
Berkelium Browser
uBrowser
Using
HtmlUnit on .NET for Headless Browser Automation (using IKVM)Zombie.js
PhantomJS
PyPhantomJS
CasperJS
Web Inspector Remote
Offscreen/Headless Mozilla
Firefox (via @brutuscat)Web
Scraping with Google Spreadsheets and XPath
Web
Scraping with YQL and Yahoo Pipes
Introduction
Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine butit’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to
explore the alternatives in more depth.
There are several ways to scrape a site that contains Javascript:
Embed a web browser within an application and simulate a normal user.
Remotely connect to a web browser and automate it from a scripting language.
Use special purpose add-ons to automate the browser
Use a framework/library to simulate a complete browser.
Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.
In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation
in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.
Setting up the environment
Prerequisites
JRE or JDK.Download the latest version of Jython fromhttp://www.jython.org/downloads.html.
Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
Download the htmlunit compiled binaries from:http://sourceforge.net/projects/htmlunit/files/.
Unzip the htmlunit to your preferred directory.
Crawling example
We will scrape the Gartner Magic Quadrant pages at:http://www.gartner.com/it/products/mq/mq_ms.jsp .If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.
gartner.py
01 | import com.gargoylesoftware.htmlunit.WebClient as WebClient |
02 | import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion |
03 |
04 | def main(): |
05 | webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object. |
06 | url = "http://www.gartner.com/it/products/mq/mq_ms.jsp" |
07 | page = webclient.getPage(url) # getting the url |
08 | articles = page.getByXPath( "//table[@id='mqtable']//tr/td/a" ) # getting all the hyperlinks |
09 |
10 | for article in articles: |
11 | "Clicking on:" , article |
12 | subpage = article.click() # click on the article link |
13 | title = subpage.getByXPath( "//div[@class='title']" ) # get title |
14 | summary = subpage.getByXPath( "//div[@class='summary']" ) # get summary |
15 | if len (title)> 0 and len (summary)> 0 : |
16 | "Title:" , title[ 0 ].asText() |
17 | "Summary:" , summary[ 0 ].asText() |
18 | # break |
19 |
20 | if __name__ = = '__main__' : |
21 | main() |
run.sh
1 | /opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py |
Final notes
This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl moreweb pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.
If you want to be polite don’t forget to read the robots.txt file before crawling…
If you like this article, you might also be interested in
DistributedScraping With Multiple Tor Circuits
Precise
Scraping with Google Chrome
Running
Your Own Anonymous Rotating Proxies
Automated
Browserless OAuth Authentication for Twitter
Resources
HtmlUnitghost.py is a webkit web client written
in python
Crowbar web scraping environment
Google Chrome
remote debugging shell from Python
Selenium web application testing system – Watir – Sahi – Windmill
Testing Framework
Internet
Explorer automation
jSSh Javascript Shell Server
for Mozilla
http://trac.webkit.org/wiki/QtWebKit
Embedding
Gecko
Opera Dragonfly
PyAuto: Python Interface
to Chromum’s automation framework
Related
questions on Stack Overflow
Scrapy
EnvJS: Simulated browser environment written
in Javascript
Setting
up Headless XServer and CutyCapt on Ubuntu
CutyCapt: Capture WebKit’s rendering
of a web page.
webmaste blog: A spider’s view of Web 2.0
OpenQA
Python Webkit DOM Bindings
Berkelium Browser
uBrowser
Using
HtmlUnit on .NET for Headless Browser Automation (using IKVM)Zombie.js
PhantomJS
PyPhantomJS
CasperJS
Web Inspector Remote
Offscreen/Headless Mozilla
Firefox (via @brutuscat)Web
Scraping with Google Spreadsheets and XPath
Web
Scraping with YQL and Yahoo Pipes
相关文章推荐
- AJAX: Creating Web Pages with Asynchronous JavaScript and XML
- Rule 8: Make JavaScript and CSS External(Chapter 8 of High performance Web Sites)
- 20 Trusted AJAX, DHTML and JavaScript Tool Sites
- pjscrape: A web-scraping framework written in Javascript, using PhantomJS and jQuery
- 《Ajax关键技术与典型案例》AJAX: Creating Web Pages with Asynchronous JavaScript and XML (Bruce Perens' Open Source Series) (Paperback)
- 学习 JavaScript and Ajax for the Web (Sixth Edition)
- JavaScript and Ajax for the Web, Sixth Edition
- Adapting to Web Standards: CSS and Ajax for Big Sites
- A web-scraping framework written in Javascript, using PhantomJS and jQuery pjscrape
- ajax(Asynchronous JavaScript And XML)
- Asynchronous JavaScript And XML (Ajax)由浅入深
- web工程师入门秘籍(CSS and javascript)
- List of Javascript Library / Ajax Framework / Web Application Framework
- The Design of Sites: Patterns, Principles, and Processes for Crafting a Customer-Centered Web Experience
- http://www.catswhocode.com/blog/15-sites-web-developers-and-designers-should-know
- Accessible XHTML and CSS Web Sites Problem Design Solution
- 基于JavaScript、Javabean、Servlet、ajax的异步请求登录注册找回密码Javaweb项目
- PHP 6 and MySQL 5 for Dynamic Web Sites: Visual QuickPro Guide
- ajax(Asynchronous JavaScript And XML)
- Ajax_Asynchronous javascript and XML