您的位置:首页 > 编程语言 > Java开发

java 动态加载的页面数据的抓取

2016-12-03 21:31 411 查看
动态加载的页面数据的抓取

动态加载页面数据有两种方法可以选择:

1模拟页面中的请求,直接获取接口返回的数据

2内建浏览器渲染页面,然后获取渲染后的数据

分析

在页面中通过拼凑参数等方法来模拟网络请求,最终获取接口数据,这种方法是可以行的通的,问题是比较麻烦。本文主要通过内建浏览器渲染这种简单粗暴的方法来实现数据的抓取。

问题来了,如何内建浏览器呢?

熟悉自动化测试同学应该都知道 Selenium ,这个模拟浏览器进行自动化测试的工具。Selenium 提供一组 API 可以与真实的浏览器内核交互。Selenium 是跨语言的,有 Java、C#、python 等版本,并且支持多种浏览器,chrome、firefox 以及 IE 都支持。

实现

我们用 Java 来写 Demo。

添加依赖

添加 Selenium 依赖,以 Maven 为例:

<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-server</artifactId>
<version>2.18.0</version>
</dependency>


下载 driver

以 chrome 为例:
https://sites.google.com/a/chromium.org/chromedriver/


下载后,最好添加环境变量。当然,也可以在调用前设置环境:

System.getProperties().setProperty("webdriver.chrome.driver",
"/Users/zhenguo/Documents/chrome/chromedriver");


注意:Mac环境下需要确认 chromedriver 是可运行的。

安装 Chrome 浏览器

测试 selenium ,代码如下:

@Ignore("need chrome driver")
@Test
public void testSelenium() {
System.getProperties().setProperty("webdriver.chrome.driver", "/Users/zhenguo/Documents/chrome/chromedriver");
WebDriver webDriver = new ChromeDriver();
webDriver.get("http://huaban.com/");
WebElement webElement = webDriver.findElement(By.xpath("/html"));
System.out.println(webElement.getAttribute("outerHTML"));
webDriver.close();
}


如果出现类似以下结果,就说明 webdriver 配置好了:

Starting ChromeDriver 2.25.426935 (820a95b0b81d33e42712f9198c215f703412e1a1) on port 2052
Only local connections are allowed.
Nov 07, 2016 12:35:11 AM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Attempting bi-dialect session, assuming Postel's Law holds true on the remote end
Nov 07, 2016 12:35:13 AM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: OSS


PS:每次 new ChromeDriver() ,Selenium都会建立一个Chrome进程,并使用一个随机端口在Java中与chrome进程进行通信来交互。我们需要调用 webDriver.close() 关闭进程。如果是网络爬虫抓取数据的话,最好用线程池来处理。

实现爬虫

上面步骤都设置好了,基于 webmagic 的爬虫实现就比较简单了,代码如下:

public class HuabanProcessor implements PageProcessor {

private Site site;

@Override
public void process(Page page) {
page.addTargetRequests(
page.getHtml().links().regex("http://huaban\\.com/.*").all());
if (page.getUrl().toString().contains("pins")) {
page.putField("img", page.getHtml().
xpath("//div[@id='baidu_image_holder']/img/@src").toString());
} else {
page.getResultItems().setSkip(true);
}
}

@Override
public Site getSite() {
if (null == site) {
site = Site.me().setDomain("huaban.com").setSleepTime(0);

4000
}
return site;
}

public static void main(String[] args) {
Spider.create(new HuabanProcessor()).thread(5)
.addPipeline(new FilePipeline("/Users/zhenguo/Documents/chrome/webmagic/test/"))
.setDownloader(new SeleniumDownloader("/Users/zhenguo/Documents/chrome/chromedriver"))
.addUrl("http://huaban.com/explore/gufenghaibao/")
.runAsync();
}
}


上面 HuabanProcessor 使用到 SeleniumDownloader ,代码如下:

package us.codecraft.webmagic.downloader.selenium;

import org.apache.log4j.Logger;
import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.downloader.Downloader;
import us.codecraft.webmagic.selector.Html;
import us.codecraft.webmagic.selector.PlainText;
import us.codecraft.webmagic.utils.UrlUtils;

import java.io.Closeable;
import java.io.IOException;
import java.util.Map;

/**
* 使用Selenium调用浏览器进行渲染。目前仅支持chrome。<br>
* 需要下载Selenium driver支持。<br>
*
* @author code4crafter@gmail.com <br>
*         Date: 13-7-26 <br>
*         Time: 下午1:37 <br>
*/
public class SeleniumDownloader implements Downloader, Closeable {

private volatile WebDriverPool webDriverPool;

private Logger logger = Logger.getLogger(getClass());

private int sleepTime = 0;

private int poolSize = 1;

private static final String DRIVER_PHANTOMJS = "phantomjs";

/**
* 新建
*
* @param chromeDriverPath chromeDriverPath
*/
public SeleniumDownloader(String chromeDriverPath) {
System.getProperties().setProperty("webdriver.chrome.driver",
chromeDriverPath);
}

/**
* Constructor without any filed. Construct PhantomJS browser
*
* @author bob.li.0718@gmail.com
*/
public SeleniumDownloader() {
// System.setProperty("phantomjs.binary.path",
// "/Users/Bingo/Downloads/phantomjs-1.9.7-macosx/bin/phantomjs");
}

/**
* set sleep time to wait until load success
*
* @param sleepTime sleepTime
* @return this
*/
public SeleniumDownloader setSleepTime(int sleepTime) {
this.sleepTime = sleepTime;
return this;
}

@Override
public Page download(Request request, Task task) {
checkInit();
WebDriver webDriver;
try {
webDriver = webDriverPool.get();
} catch (InterruptedException e) {
logger.warn("interrupted", e);
return null;
}
logger.info("downloading page " + request.getUrl());
webDriver.get(request.getUrl());
try {
Thread.sleep(sleepTime);
} catch (InterruptedException e) {
e.printStackTrace();
}
WebDriver.Options manage = webDriver.manage();
Site site = task.getSite();
if (site.getCookies() != null) {
for (Map.Entry<String, String> cookieEntry : site.getCookies()
.entrySet()) {
Cookie cookie = new Cookie(cookieEntry.getKey(),
cookieEntry.getValue());
manage.addCookie(cookie);
}
}

/*
* TODO You can add mouse event or other processes
*
* @author: bob.li.0718@gmail.com
*/

WebElement webElement = webDriver.findElement(By.xpath("/html"));
String content = webElement.getAttribute("outerHTML");
Page page = new Page();
page.setRawText(content);
page.setHtml(new Html(UrlUtils.fixAllRelativeHrefs(content,
request.getUrl())));
page.setUrl(new PlainText(request.getUrl()));
page.setRequest(request);
webDriverPool.returnToPool(webDriver);
return page;
}

private void checkInit() {
if (webDriverPool == null) {
synchronized (this) {
webDriverPool = new WebDriverPool(poolSize);
}
}
}

@Override
public void setThread(int thread) {
this.poolSize = thread;
}

@Override
public void close() throws IOException {
webDriverPool.closeAll();
}
}
WebDriverPool 代码如下:

package us.codecraft.webmagic.downloader.selenium;

import org.apache.log4j.Logger;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriverService;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;

import java.io.FileReader;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.BlockingDeque;
import java.util.concurrent.LinkedBlockingDeque;
import java.util.concurrent.atomic.AtomicInteger;

/**
* @author code4crafter@gmail.com <br>
*         Date: 13-7-26 <br>
*         Time: 下午1:41 <br>
*/
class WebDriverPool {
private Logger logger = Logger.getLogger(getClass());

private final static int DEFAULT_CAPACITY = 5;

private final int capacity;

private final static int STAT_RUNNING = 1;

private final static int STAT_CLODED = 2;

private AtomicInteger stat = new AtomicInteger(STAT_RUNNING);

/*
* new fields for configuring phantomJS
*/
private WebDriver mDriver = null;
private boolean mAutoQuitDriver = true;

private static final String CONFIG_FILE = "/Users/zhenguo/Documents/develop/github/webmagic/webmagic-selenium/config.ini";
private static final String DRIVER_FIREFOX = "firefox";
private static final String DRIVER_CHROME = "chrome";
private static final String DRIVER_PHANTOMJS = "phantomjs";

protected static Properties sConfig;
protected static DesiredCapabilities sCaps;

/**
* Configure the GhostDriver, and initialize a WebDriver instance. This part
* of code comes from GhostDriver.
* https://github.com/detro/ghostdriver/tree/master/test/java/src/test/java/ghostdriver *
* @author bob.li.0718@gmail.com
* @throws IOException
*/
public void configure() throws IOException {
// Read config file
sConfig = new Properties();
sConfig.load(new FileReader(CONFIG_FILE));

// Prepare capabilities
sCaps = new DesiredCapabilities();
sCaps.setJavascriptEnabled(true);
sCaps.setCapability("takesScreenshot", false);

String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);

// Fetch PhantomJS-specific configuration parameters
if (driver.equals(DRIVER_PHANTOMJS)) {
// "phantomjs_exec_path"
if (sConfig.getProperty("phantomjs_exec_path") != null) {
sCaps.setCapability(
PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,
sConfig.getProperty("phantomjs_exec_path"));
} else {
throw new IOException(
String.format(
"Property '%s' not set!",
PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY));
}
// "phantomjs_driver_path"
if (sConfig.getProperty("phantomjs_driver_path") != null) {
System.out.println("Test will use an external GhostDriver");
sCaps.setCapability(
PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_PATH_PROPERTY,
sConfig.getProperty("phantomjs_driver_path"));
} else {
System.out
.println("Test will use PhantomJS internal GhostDriver");
}
}

// Disable "web-security", enable all possible "ssl-protocols" and
// "ignore-ssl-errors" for PhantomJSDriver
// sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new
// String[] {
// "--web-security=false",
// "--ssl-protocol=any",
// "--ignore-ssl-errors=true"
// });

ArrayList<String> cliArgsCap = new ArrayList<String>();
cliArgsCap.add("--web-security=false");
cliArgsCap.add("--ssl-protocol=any");
cliArgsCap.add("--ignore-ssl-errors=true");
sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS,
cliArgsCap);

// Control LogLevel for GhostDriver, via CLI arguments
sCaps.setCapability(
PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_CLI_ARGS,
new String[] { "--logLevel="
+ (sConfig.getProperty("phantomjs_driver_loglevel") != null ? sConfig
.getProperty("phantomjs_driver_loglevel")
: "INFO") });

// String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);

// Start appropriate Driver
if (isUrl(driver)) {
sCaps.setBrowserName("phantomjs");
mDriver = new RemoteWebDriver(new URL(driver), sCaps);
} else if (driver.equals(DRIVER_FIREFOX)) {
mDriver = new FirefoxDriver(sCaps);
} else if (driver.equals(DRIVER_CHROME)) {
mDriver = new ChromeDriver(sCaps);
} else if (driver.equals(DRIVER_PHANTOMJS)) {
mDriver = new PhantomJSDriver(sCaps);
}
}

/**
* check whether input is a valid URL
*
* @author bob.li.0718@gmail.com
* @param urlString urlString
* @return true means yes, otherwise no.
*/
private boolean isUrl(String urlString) {
try {
new URL(urlString);
return true;
} catch (MalformedURLException mue) {
return false;
}
}

/**
* store webDrivers created
*/
private List<WebDriver> webDriverList = Collections
.synchronizedList(new ArrayList<WebDriver>());

/**
* store webDrivers available
*/
private BlockingDeque<WebDriver> innerQueue = new LinkedBlockingDeque<WebDriver>();

public WebDriverPool(int capacity) {
this.capacity = capacity;
}

public WebDriverPool() {
this(DEFAULT_CAPACITY);
}

/**
*
* @return
* @throws InterruptedException
*/
public WebDriver get() throws InterruptedException {
checkRunning();
WebDriver poll = innerQueue.poll();
if (poll != null) {
return poll;
}
if (webDriverList.size() < capacity) {
synchronized (webDriverList) {
if (webDriverList.size() < capacity) {

// add new WebDriver instance into pool
try {
configure();
innerQueue.add(mDriver);
webDriverList.add(mDriver);
} catch (IOException e) {
e.printStackTrace();
}

// ChromeDriver e = new ChromeDriver();
// WebDriver e = getWebDriver();
// innerQueue.add(e);
// webDriverList.add(e);
}
}

}
return innerQueue.take();
}

public void returnToPool(WebDriver webDriver) {
checkRunning();
innerQueue.add(webDriver);
}

protected void checkRunning() {
if (!stat.compareAndSet(STAT_RUNNING, STAT_RUNNING)) {
throw new IllegalStateException("Already closed!");
}
}

public void closeAll() {
boolean b = stat.compareAndSet(STAT_RUNNING, STAT_CLODED);
if (!b) {
throw new IllegalStateException("Already closed!");
}
for (WebDriver webDriver : webDriverList) {
logger.info("Quit webDriver" + webDriver);
webDriver.quit();
webDriver = null;
}
}

}


以上代码参考自 https://github.com/code4craft/webmagic

到此,动态加载的页面数据抓取就实现了。本文使用 selenium 作为渲染的方法,还有很多其他的方法,例如 phantomjshtmlunit 等。有空了可以尝试其他的方法,希望本文对你有所帮助。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐