基于crawler4j、jsoup、javacsv的爬虫实践
2017-10-23 16:21
477 查看
1. crawler4j基础
crawler4j是一个基于Java的爬虫开源项目,其官方地址如下:http://code.google.com/p/crawler4j/
crawler4j的使用主要分为两个步骤:
实现一个继承自WebCrawler的爬虫类;
通过CrawlController调用实现的爬虫类。
WebCrawler是一个抽象类,继承它必须实现两个方法:shouldVisit和visit。其中:
shouldVisit是判断当前的URL是否已经应该被爬取(访问);
visit则是爬取该URL所指向的页面的数据,其传入的参数即是对该web页面全部数据的封装对象Page。
另外,WebCrawler还有其它一些方法可供覆盖,其方法的命名规则类似于Android的命名规则。如getMyLocalData方法可以返回WebCrawler中的数据;onBeforeExit方法会在该WebCrawler运行结束前被调用,可以执行一些资源释放之类的工作。
相对而言,CrawlController的调用就比较格式化了。一般地,它的调用代码如下:
[java] view
plain copy
String crawlStorageFolder = "data/crawl/root";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/~welling/");
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
CrawlController自带多线程功能,start方法中的第二个参数numberOfCrawlers即是同时开启的线程数。
另外,由于CrawlController对WebCrawler特殊的调用方式——反射(上述代码最后一行),因此WebCrawler的实现类必须拥有无参的构造方法,且有参的构造方法不会生效。对WebCrawler的实现类的私有成员的赋值需要通过静态方法来实现,示例参见crawler4j提供的例子:Image
Crawler
更多信息请参见crawler4j的代码和示例。
2. jsoup基础
jsoup是一个基于Java的开源HTML解析器,其官网地址如下:http://jsoup.org/
jsoup最大的特点,或者说,它比使用DOM4J进行HTML解析更好的原因,是它可以采用jQuery选择器的语法。
例如:
[java] view
plain copy
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
上述代码就是获取了http://en.wikipedia.org/页面中id为mp-itn的元素下的<b>标签中的<a>标签,与jQuery选择器的结果一致。
更多jsoup的使用方法请参见jsoup的示例(在主页的右侧Cookbook Content中)。
需要特别说明的是,jsoup中主要有三种操作对象:Document、Elements及Element。其中:
Document继承自Element类,它包含页面中的全部数据,通过Jsoup类的静态方法获得;
Elements是Element的集合类;
Element是页面元素的实体类,包含了诸多操作页面元素的方法,其中除了类似于jQuery选择器的select方法外,还有大量类似于JS和jQuery的对DOM元素进行操作的方法,如getElementById,text,addClass,等等。
3. javacsv基础
javacsv是一个基于Java的开源CSV文件读写工具,其官方地址如下:http://www.csvreader.com/java_csv.php
CSV文件的读写其实很简单,可以自己实现,网上也有诸多示例。使用javacsv的原因在于其代码简洁易用。
javacsv的使用示例参见其官方示例:
http://www.csvreader.com/java_csv_samples.php
需要说明的是,读写CSV文件时,若存在中文,请尽量使用FileReader(读)及FileWriter(写),而非FileInputStream和FileOutputStream,以免出现乱码。
4. 爬虫实践
下面的实践的目标是爬取搜车网的全部二手车信息,并作为CSV文件输出。代码如下:Maven pom.xml
[html] view
plain copy
<dependency>
<groupId>edu.uci.ics</groupId>
<artifactId>crawler4j</artifactId>
<version>3.5</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.3</version>
</dependency>
<dependency>
<groupId>net.sourceforge.javacsv</groupId>
<artifactId>javacsv</artifactId>
<version>2.0</version>
</dependency>
MyCrawler.java
[java] view
plain copy
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.csvreader.CsvWriter;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern
.compile(".*(\\.(css|js|bmp|gif|jpe?g|ico"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
private final static String URL_PREFIX = "http://www.souche.com/pages/onsale/sale_car_list.html?";
private final static Pattern URL_PARAMS_PATTERN = Pattern
.compile("carbrand=brand-\\d+(&index=\\d+)?");
private final static String CSV_PATH = "data/crawl/data.csv";
private CsvWriter cw;
private File csv;
public MyCrawler() throws IOException {
csv = new File(CSV_PATH);
if (csv.isFile()) {
csv.delete();
}
cw = new CsvWriter(new FileWriter(csv, true), ',');
cw.write("title");
cw.write("brand");
cw.write("newPrice");
cw.write("oldPrice");
cw.write("mileage");
cw.write("age");
cw.write("stage");
cw.endRecord();
cw.close();
}
/**
* You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic).
*/
@Override
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (FILTERS.matcher(href).matches() || !href.startsWith(URL_PREFIX)) {
return false;
}
String[] strs = href.split("\\?");
if (strs.length < 2) {
return false;
}
if (!URL_PARAMS_PATTERN.matcher(strs[1]).matches()) {
return false;
}
return true;
}
/**
* This function is called when a page is fetched and ready to be processed
* by your program.
*/
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
Document doc = Jsoup.parse(html);
String brand = doc.select("div.choose_item").first().text();
Elements contents = doc.select("div.list_content");
if (contents.size() == 20 && !url.contains("index=")) {
return;
} else {
System.out.println("URL: " + url);
}
for (Element c : contents) {
Element info = c.select(".list_content_carInfo").first();
String title = info.select("h1").first().text();
Elements prices = info.select(".list_content_price div");
String newPrice = prices.get(0).text();
String oldPrice = prices.get(1).text();
Elements others = info.select(".list_content_other div");
String mileage = others.get(0).select("ins").first().text();
String age = others.get(1).select("ins").first().text();
String stage = "unknown";
if (c.select("i.car_tag_zhijian").size() != 0) {
stage = c.select("i.car_tag_zhijian").text();
} else if (c.select("i.car_tag_yushou").size() != 0) {
stage = "presell";
}
try {
cw = new CsvWriter(new FileWriter(csv, true), ',');
cw.write(title);
cw.write(brand);
cw.write(newPrice.replaceAll("[¥万]", ""));
cw.write(oldPrice.replaceAll("[¥万]", ""));
cw.write(mileage);
cw.write(age);
cw.write(stage);
cw.endRecord();
cw.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
Controller.java
[java] view
plain copy
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "data/crawl/root";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/~welling/");
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}
}
相关文章推荐
- 基于crawler4j、jsoup、javacsv的爬虫实践
- 基于crawler4j、jsoup、javacsv的爬虫实践
- 基于crawler4j、jsoup、javacsv的爬虫实践
- 基于crawler4j、jsoup、javacsv的爬虫实践
- 【正完成】Java基于Jsoup的网络爬虫工具实现
- CSDN Android客户端开发(二):详解如何基于Java用Jsoup爬虫HTML数据
- 基于Crawler4j + jsoup实现爬虫
- 爬虫实践-基于Jsoup爬取Facebook群组成员信息
- CSDN Android客户端开发(二):详解如何基于Java用Jsoup爬虫HTML数据
- 基于java的分布式爬虫
- 一个简单的基于Jsoup的HTML信息抓取Java程序
- Java实现爬虫给App提供数据(Jsoup 网络爬虫)
- Java爬虫入门之Jsoup使用
- 基于Jsoup实现的简单网络爬虫
- Java网络编程实践和总结 --- 基于TCP的Socket编程之echo回显的操作
- 学习用java基于webMagic+selenium+phantomjs实现爬虫Demo爬取淘宝搜索页面
- JAVA简单爬虫例子--Jsoup的运用
- java爬虫使用jsoup.jar包-自定义webUrl编码方式
- java爬虫实战简单用Jsoup框架进行网页爬虫(如抓取网页图片)
- Zabbix实践(五):基于java的zabbix api调用实现数据共享