爬虫抓取5大门户网站和电商数据day1:基础环境搭建
2016-02-25 11:20
465 查看
最新想用爬虫实现抓取五大门户网站(搜狐、新浪、网易、腾讯、凤凰网)和电商数据(天猫,京东,聚美等), 今天第一天先搭建下环境和测试。
采用maven+xpath+ HttpClient+正则表达式。
maven pom.xml配置文件信息
新建一个测试类:SpiderTest
采用maven+xpath+ HttpClient+正则表达式。
maven pom.xml配置文件信息
<dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>1.6.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.10</artifactId> <version>1.6.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>1.6.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_2.10</artifactId> <version>1.6.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-graphx_2.10</artifactId> <version>1.6.0</version> </dependency> <!-- httpclient4.4 --> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.4</version> </dependency> <!-- htmlcleaner --> <dependency> <groupId>net.sourceforge.htmlcleaner</groupId> <artifactId>htmlcleaner</artifactId> <version>2.10</version> </dependency> <!-- json --> <dependency> <groupId>org.json</groupId> <artifactId>json</artifactId> <version>20140107</version> </dependency> <!-- hbase --> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>0.96.1.1-hadoop2</version> </dependency> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-server</artifactId> <version>0.96.1.1-hadoop2</version> </dependency> <!-- redis 2.7.0--> <dependency> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>2.7.0</version> </dependency> <!-- slf4j --> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>1.7.10</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.7.10</version> </dependency> <!-- quartz1.8.4 --> <dependency> <groupId>org.quartz-scheduler</groupId> <artifactId>quartz</artifactId> <version>1.8.4</version> </dependency> <!-- curator --> <dependency> <groupId>org.apache.curator</groupId> <artifactId>curator-framework</artifactId> <version>2.7.1</version> </dependency>
新建一个测试类:SpiderTest
/** * url 入口,下载页面 * @param url */ public static String downLoadCrawlurl(String url){ String context = null; Logger logger = LoggerFactory.getLogger(SpiderTest.class); HttpClientBuilder create = HttpClientBuilder.create(); HttpGet httpGet = new HttpGet(url); CloseableHttpClient build = create.build(); try { CloseableHttpResponse response = build.execute( httpGet); HttpEntity entity = response.getEntity(); context = EntityUtils.toString( entity ); System.out.println("context:" + context); } catch ( ClientProtocolException e ) { e.printStackTrace(); } catch ( IOException e ) { logger.info("download...." ); } return context; }
public static void main( String[] args ) { String url = "http://money.163.com/"; downLoadCrawlurl(url); }
相关文章推荐
- IIS服务器运行一段时间后卡死,且无法打开网站(IIS管理无响应,必须重启电脑)
- 网站部署后Parser Error Message: Could not load type 的解决方案
- 理解RESTful架构(转载)
- 要看的网站
- 网站搭建【2】-云服务器购买
- 网站搭建【1】-域名注册
- 关于未来十年企业架构的十个关键词
- 新版架构图详解和旧版比较
- 处理解决方案中网站名称为副本名称的方法
- 网站性能检测工具 -xhprof
- mvc架构的理解
- 开发者必备的网站-你都收藏了么?
- 软件行业里常说的“架构”,究竟是什么东西?
- 企业网站更新时应该注意些什么
- VS2012发布网站详细步骤,同样适合vs2013,亲身试过
- LAMP架构搭建与优化(3.3-3.5)
- 新项目的架构(杂记)
- 大型分布式网站架构技术总结
- 系统架构
- Kubernetes基本术语与架构