开源网络爬虫WebCollector的demo
2016-02-03 16:15
465 查看
1、环境:jdk7+eclipse mars
2、WebCollector开源网址https://github.com/CrawlScript/WebCollector
下载webcollector-2.26-bin.zip,解压文件夹引入所有jar包到工程。
3、demo源码:
4、实际应用中,对page进行解析抓取网页内容。
2、WebCollector开源网址https://github.com/CrawlScript/WebCollector
下载webcollector-2.26-bin.zip,解压文件夹引入所有jar包到工程。
3、demo源码:
/** * Demo of crawling web by webcollector * @author fjs */ package com; import cn.edu.hfut.dmic.webcollector.model.CrawlDatums; import cn.edu.hfut.dmic.webcollector.model.Page; import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler; import org.jsoup.nodes.Document; public class demo extends BreadthCrawler { /** * @param crawlPath crawlPath is the path of the directory which maintains * information of this crawler * @param autoParse if autoParse is true,BreadthCrawler will auto extract * links which match regex rules from page */ public demo(String crawlPath, boolean autoParse) { super(crawlPath, autoParse); /*start page*/ this.addSeed("http://guangzhou.qfang.com"); /*fetch url like the value by setting up RegEx filter rule */ this.addRegex(".*"); /*do not fetch jpg|png|gif*/ this.addRegex("-.*\\.(jpg|png|gif).*"); /*do not fetch url contains #*/ this.addRegex("-.*#.*"); } @Override public void visit(Page page, CrawlDatums next) { String url = page.getUrl(); Document doc = page.getDoc(); System.out.println(url); System.out.println(doc.title()); /*If you want to add urls to crawl,add them to nextLink*/ /*WebCollector automatically filters links that have been fetched before*/ /*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*/ //next.add("http://gz.house.163.com/"); } public static void main(String[] args) throws Exception { demo crawler = new demo("path", true); crawler.setThreads(50); crawler.setTopN(100); //crawler.setResumable(true); /*start crawl with depth 3*/ crawler.start(3); } }
4、实际应用中,对page进行解析抓取网页内容。
相关文章推荐
- HttpClient 教程 (二)
- caffe 一些网络参数
- Http状态码详解
- HttpClient 教程 (一)
- Socket网络通信机制
- Ajax学习(三)——XMLHttpRequest对象的五步使使用方法
- 转载:Adb远程连接Android系统(通过网络使用ADB(Connect to android with wifi))
- TCP协议的三次握手+四次断开
- TCP/IP、Http、Socket的区别
- WCF调用时提示错误 "已尝试创建到达不支持 .Net 框架的服务的通道。可能遇到 HTTP 终结点"
- 如何看当前本机的网络流量
- 快速Android开发系列网络篇之Retrofit
- Android 网络框架学习之Retrofit
- (25)HttpClient session
- (24)如何使用HttpClient
- Java Socket实现HTTP客户端来理解Session和Cookie的区别和联系
- 设置虚拟机hostonly网络上网
- vmware虚拟机安装CENTOS系统使用NAT连接网络方法
- HTTPS详解SSL/TLS
- HttpUtil 工具类