您的位置：首页 > 编程语言 > Java开发

关于在多线程情况下同步爬虫爬取结果的一个例子

2017-01-24 14:13 507 查看

这些天一直在用java做爬虫工作，之前遇到的都比较简单，大多都是单界面的爬取，这次需要爬虫100多个界面，肯定得多跑几个线程然而这些界面由于信息中有重复，leader要求我们必须去重，因为数据库更改是有次数限制的。所以搞了几天，才把这个程序写出来。先写一下思想：首先，利用JAVA自己带的线程安全的集合，ConcurrentHashMap进行一个自动去重的工作。但是在多线程情况下，一定要注意线程同步，集合类的线程安全，仅仅是存的时候是锁住的，这不代表我们在进行条件判断时候也是线程安全的，这就需要我们自己对需要同步的代码块上锁。

public static ConcurrentHashMap<String,Integer>  hashMap=new ConcurrentHashMap<String, Integer>();

AtomicInteger integer = new AtomicInteger();
public void process(Page page)
{

if(page.getUrl().toString()=="https://www.toryburch.com/stores-viewal")

{
for (String a : url) {

page.addTargetRequest(a);
}
}

else
{

String shopId = "";
String branchName = "";
StringBuffer address = new StringBuffer();
String crawlUrl = "";
String region = "";
String city = "";
String country = "the United States";
String shopName = "LanCome";
String sourceName = "LanCome";
String phone = "";
String openTime = "";
double lat = 0.0;
double lng = 0.0;

List<Selectable> infoList=page.getHtml().xpath("poi").nodes();
List<BrandPoiDto> brandPoiDtos =new ArrayList<BrandPoiDto>();
//防止有些没有获取到商户信息，
if(infoList.size()>=1)
{
for(Selectable b:infoList)
{
phone=b.xpath("//phone/text()").toString();
shopId=b.xpath("//uid/text()").toString();
city=b.xpath("//city/text()").toString();
branchName=b.xpath("//name/text()").toString();
lat=Double.parseDouble(b.xpath("//latitude/text()").toString());
lng=Double.parseDouble(b.xpath("//longitude/text()").toString());

//提取地址：包含address1和address2
address=address.append(b.xpath("//address1/text()").toString()).append(b.xpath("//address2/text()").toString());

<
4000
span style="color:rgb(204,120,50);">
BrandPoiDto dto = new BrandPoiDto();

dto.setBranchName(branchName);
dto.setAddress(address.toString());
dto.setCrawlUrl(page.getUrl().toString());
dto.setCity(city);
dto.setCountry(country);
dto.setPhone(phone);
dto.setLat(lat);
dto.setLng(lng);
dto.setShopName(shopName);
dto.setSourceName(sourceName);
dto.setShopId(shopId);

synchronized (this)
{
if(hashMap.get(shopId)==null)
{
hashMap.put(shopId,1);
brandPoiDtos.add(dto);

System.out.println(dto);
integer.incrementAndGet();

}

}

//对数据进行清除
address.delete(0,address.length());
phone="";
lat=0.0;
lng=0.0;
shopId="";
branchName="";
city="";

}
}
//之后要入库操作
System.out.println("总共抓取数量为"+integer);

}

代码如下，我用synchronized同步的代码块，先判断在hashMap中有没有这个信息（key为商户uid，是唯一的），如果没有，则将数据存入，并设置标志位1，如果有的话，就不进行存储和打印验证了。通过同步代码块和用hashmap，解决了多个线程抓取时重复数据的去重问题。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： JAVA 多线程爬虫线程安全编程思维

相关文章推荐

新的分享

章节导航