您的位置:首页 > 编程语言 > Java开发

一个简化版的Java多线程爬虫

2015-12-13 11:46 495 查看
        先前写了一个单线程的爬虫,花了一段时间思考如何改进为多线程的版本,已有了思路。为了试验想法是否正确,有了下面

这个简化版的例子。

        问题描述如下:有一个存放整数的任务队列,构建一个线程池,释放的线程执行的任务是:从任务队列里取出一个整数,将

这个整数放入存放已取整数的列表中,计算这个整数的2倍值、3倍值、4倍值,将它们添加到任务队列中。线程池不断地释放线程

执行上述任务直至从任务队列中取出来的整数个数达到20个。

        上述例子中的共享资源包括一个队列和一个列表,凡是涉及到对它们进行读和写的操作都要加锁,将所有的共享资源封装在

SharedResources类中.

package threadtest;

import java.util.ArrayList;
import java.util.List;
import java.util.Queue;

public class SharedResources {
private Queue<Integer> queue = null; // 任务队列
private List<Integer> list = null; // 已访问对象列表

public SharedResources(Queue<Integer> queue) {
this.queue = queue;
this.list = new ArrayList<Integer>();
}

public int getListSize() {
return this.list.size();
}

public int getQueueSize() {return this.queue.size();}

public synchronized int getNextNumber() {
while (this.queue.size() == 0) {
try {
wait();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
int nextNumber = this.queue.poll();
notifyAll();
return nextNumber;
}

public synchronized void addNumberToQueue(int number) {
queue.add(number);
notifyAll();
}

public synchronized void addNumberToList(int number) {
list.add(number);
notifyAll();
}

public boolean isQueueEmpty() {
return this.queue.isEmpty();
}

public String queueToString() {
String result = "";
if (!queue.isEmpty()) {
for (Integer ele: queue) {
result = result + String.valueOf(ele) + ",";
}
return result;
}
return "Warning! the Queue is empty";
}

public String listToString() {
String result = "";
if (!list.isEmpty()) {
for (Integer ele: list) {
result = result + String.valueOf(ele) + ",";
}
return result;
}
return "Warning! the List is empty";
}

}
        Carwler类负责从任务队列中取出整数并放入列中.

package threadtest;

import java.util.Date;
import java.util.Random;

public class Crawler implements Runnable {
private SharedResources sharedResources;
private String id;

public Crawler(SharedResources sharedResources, int id) {
this.sharedResources = sharedResources;
this.id = "crawler-" + String.valueOf(id);
}

public void run() {
System.out.printf("Thread: %s, Crawler: %s, Status: started, Time: %s\n", Thread.currentThread().getName(), id, new Date());
System.out.println("|");

int nextNumber = sharedResources.getNextNumber();
System.out.printf("%s has removed %d from the Queue\n", Thread.currentThread().getName(), nextNumber);
System.out.printf("After removing, Elements in Queue: %s\n", sharedResources.queueToString());
System.out.println("|");

sharedResources.addNumberToList(nextNumber);
System.out.printf("%s has added %d to the List\n", Thread.currentThread().getName(), nextNumber);
System.out.printf("Afetr adding, Elements in List: %s\n", sharedResources.listToString());
System.out.println("|");

try { // 线程休眠一会,模拟线程正在做某些事情
Thread.sleep(new Random().nextInt(10000));
} catch (InterruptedException e) {
e.printStackTrace();
}

int number1 = 2 * nextNumber;
int number2 = 3 * nextNumber;
int number3 = 4 * nextNumber;

try { // 线程休眠一会,模拟线程正在做某些事情
Thread.sleep(new Random().nextInt(10000));
} catch (InterruptedException e) {
e.printStackTrace();
}

sharedResources.addNumberToQueue(number1);
sharedResources.addNumberToQueue(number2);
sharedResources.addNumberToQueue(number3);
System.out.printf("%s has added %d, %d, %d to the Queue\n", Thread.currentThread().getName(), number1, number2, number3);
System.out.printf("After adding, Elements in Queue: %s\n", sharedResources.queueToString());
System.out.printf("Thread: %s, Task: %s, Status: ended, Time: %s\n", Thread.currentThread().getName(), id, new Date());
System.out.println("|");
}
}
        Schedule类负则调度线程执行上述爬虫.

package threadtest;

import java.util.LinkedList;
import java.util.Queue;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;

public class Schedule {
private static int maxNumberThreads = 4;

public static void main(String[] args) throws IOException, InterruptedException {
ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(maxNumberThreads);
Queue<Integer> queue = new LinkedList<Integer>();
queue.add(2);
SharedResources sharedResources = new SharedResources(queue);
System.out.printf("Before executing tasks, Elements in Queue: %s\n", sharedResources.queueToString());
System.out.printf("Before executing tasks, Elements in List: %s\n", sharedResources.listToString());
int id = 0;
System.out.printf("Before executing tasks, number of completed tasks: %d\n", countCompletedTasks);
System.out.println("-------------------------------------");

while (sharedResources.getQueueSize() != 0) {
id = id + 1; // 累计任务总数
executor.execute(new Crawler(sharedResources, id));
if (id == 20) break; // 如果已执行任务数达到20个,等候已分配的任务执行完毕,不再分配新的任务
do {
Thread.sleep(10);
} while ((sharedResources.getQueueSize() == 0 && executor.getActiveCount() != 0) ||
(executor.getActiveCount() == maxNumberThreads));
System.out.println("-------------------------------------");
}
// 主线程休眠直至所有任务均已完成
do {
Thread.sleep(1000);
} while (executor.getActiveCount() > 0);

System.out.printf("Finally, Elements in Queue: %s\n", sharedResources.queueToString());
System.out.printf("Finally, Elements in List: %s\n", sharedResources.listToString());
executor.shutdown();
}
}
        多线程爬虫的关键在于上方的do{}while()循环,有两种情形导致线程池在释放一个线程之后要停顿一会再释放另一个线程。

        情形一:最开始执行第一个爬虫,取出任务队列的整数后,队列为空,提前退出循环,解决方法是:当队列为空并且还有任务

正在执行时,让主线程休眠等待有线程完成任务,队列变成非空,从而线程池释放线程执行新的任务。

        情形二:当线程中同一时刻执行任务的线程个数为最大值4时,暂时无法释放线程立即执行新的任务,解决方法是:让主线程休眠

直至有线程任务完成。

        在自己电脑上运行上述代码,没出任何问题,再花一点时间应该能够将其改进为能够执行真实爬取任务的多线程爬虫.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息