您的位置:首页 > 其它

制作一个听话的电影种子挖掘器

2016-06-07 10:36 639 查看
每次回到宿舍想看部电影才发现很长时间没有去bt站淘种子了, 然而天天去站上找适合自己类型的电影又是一件费时又费力的事儿, 所以周末花时间写了一个可配置的爬子, 能够根据不同人的不同需求去自动下载种子文件, 并且能够避免不同分类中的重复电影

后期还会加入下载队列的功能, 在检测宿舍无人用网的时候开启bt下载, 有人接入wifi就暂停

项目地址: https://github.com/hwding/btDigger , 众人拾柴火焰高, 欢迎共同制作...

(爬子依赖'bt天堂'这个种子站)

运行起来是这个样子的

void parseCategoryPage() {
System.out.println("[o] You are banning films from "+configLoader.getRegions_banned().toString());
System.out.println("[o] You want to dig into each category with the depth of: "+configLoader.getDepth());
if (configLoader.getDepth() < 1) {
System.out.println("[x] Depth can not be smaller than 1");
System.exit(0);
}
else if (configLoader.getDepth() > 5) {
System.out.println("[i] Depth may be too large");
}
System.out.print("[o] Collecting films into each category...");
int counterDuplicated = 0;
int counterBanned = 0;
boolean isBanned;
boolean isDuplicated;
for (String each : targetCategoriesSubURLs) {
for (int i = 1; i < configLoader.getDepth() + 1; i++) {
try {
URL url = new URL(HOST + each + i);
Document document = Jsoup.parse(url, 5000);
Elements filmTitles = document.select("div[class=\"title\"]");
for (Element eachFilmTitle : filmTitles) {
if (!"".equals(eachFilmTitle.select("font").text())) {
isBanned = false;
isDuplicated = false;
for (String eachBannedLocation : configLoader.getRegions_banned()) {
if (eachFilmTitle.select("p[class=\"des\"]").text().contains(eachBannedLocation)) {
counterBanned++;
isBanned = true;
}
}
for (String eachValidFileTitle : validFilmTitles) {
if (eachFilmTitle.select("font").text().contains(eachValidFileTitle)) {
counterDuplicated++;
isDuplicated = true;
}
}
if (!isBanned && !isDuplicated) {
validFilmTitles.add(eachFilmTitle.select("font").text());
validFilmSubURLs.add(eachFilmTitle.select("a").first().attr("href"));
}
}
}
} catch (MalformedURLException e) {
System.out.println("\n[x] Internal error: MalformedURL");
} catch (IOException e) {
System.out.println("\n[x] An error occurred when trying to read the page");
}
}
}
System.out.println("OK");
System.out.println("[o] \t"+counterBanned+" films banned");
System.out.println("[o] \t"+counterDuplicated+" films dropped due to duplication");
parseFilmPage();
}
注意这里会先收集分类页面底部的页数导航栏, 并且用一个循环去根据指定的深度访问页面

此处深度为2, 所以会访问第一页和第二页

分类页面底部的页数导航栏在网页源代码中是这样的

1 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/1/'>首页</a></li>
2  <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/-1/'>上一页</a></li>
3  <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/1/'>1</a></li>
4 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/2/'>2</a></li>
5 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/3/'>3</a></li>
6 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/4/'>4</a></li>
7 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/5/'>5</a></li>
8 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/6/'>6</a></li>
9 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/7/'>7</a></li>
10 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/8/'>8</a></li>
11 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/9/'>9</a></li>
12 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/10/'>10</a></li>
13 <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/11/'>11</a></li>
14  <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/1/'>下一页</a></li>
15  <li><a href='/category.php?/%E7%8A%AF%E7%BD%AA/102/'>末页</a></li>


解析的方法同上

完成并收集了符合要求的电影的页面链接之后, 我们将开始进入到每一个电影的详情页并找到最好的种子去下载

这里会调用 parseFilmPage() 方法

每个电影详情页会有种子列表, 提供不同清晰度(720p, 1080p, BluRay 720p, BluRay 1080p等), 为了不牺牲清晰度也不让硬盘爆满, 配置文件中指定最喜爱的清晰度为1080p

种子列表在网页源代码中如下, 解析方式同上上...

1 <div class="tinfo">
2 <a href="/download.php?n=%E6%B4%9B%E5%9F%8E%E5%B1%A0%E6%89%8Bbt%E7%A7%8D%E5%AD%90%E4%B8%8B%E8%BD%BD.720p%E9%AB%98%E6%B8%85.torrent&temp=yes&id=27808&uhash=b57db4fed7d35c8d0924033f" title="【720p高清】洛城屠手 /L.A. Slasher .2015.1.02GBBT种子下载" target="_blank"><p class="torrent"><img border="0" src="/style/torrent.gif" style="vertical-align:middle" alt="">【720p高清】洛城屠手<i>/L.A. Slasher</i>.2015.<em>1.02GB</em>.torrent</p></a>
3 <ul class="btTree treeview"><li><span class="file"><font color="#999">本torrent文件由BT天堂(www.BTtiantang.com)提供!</font></span></li><li><span class="video">L.A.Slasher.2015.720p.BluRay.H264.AAC-RARBG.mp4<small>1.02GB</small></span></li><li><span class="file">L.A.Slasher.2015.720p.BluRay.H264.AAC-RARBG.nfo<small>3.97KB</small></span></li><li><span class="video">RARBG.mp4<small>992.93KB</small></span></li></ul>
4 </div>
5 <div class="tinfo">
6 <a href="/download.php?n=%E6%B4%9B%E5%9F%8E%E5%B1%A0%E6%89%8Bbt%E7%A7%8D%E5%AD%90%E4%B8%8B%E8%BD%BD.1080p%E9%AB%98%E6%B8%85.torrent&temp=yes&id=27808&uhash=04645321cb7afdbdee192d1d" title="【1080p高清】洛城屠手 /L.A. Slasher .2015.1.61GBBT种子下载" target="_blank"><p class="torrent"><img border="0" src="/style/torrent.gif" style="vertical-align:middle" alt="">【1080p高清】洛城屠手<i>/L.A. Slasher</i>.2015.<em>1.61GB</em>.torrent</p></a>
7 <ul class="btTree treeview"><li><span class="file"><font color="#999">本torrent文件由BT天堂(www.BTtiantang.com)提供!</font></span></li><li><span class="video">L.A.Slasher.2015.1080p.BluRay.H264.AAC-RARBG.mp4<small>1.61GB</small></span></li><li><span class="file">L.A.Slasher.2015.1080p.BluRay.H264.AAC-RARBG.nfo<small>3.97KB</small></span></li><li><span class="video">RARBG.mp4<small>992.93KB</small></span></li></ul>
8 </div>


这里仅仅提供了720p和1080p两种清晰度的种子, 在多种清晰度的情况下, 爬子发现1080p的就会直接跳出对种子列表的遍历, 否则就去下载最高清的那个

1 for (Element eachBtFileLink : btFileLinks) {
2                     Element info = eachBtFileLink.select("span[class=\"video\"]").first();
3                     if (info.text().contains(configLoader.getDefinition())) {
4                         targetBtFileLinkSuffix = eachBtFileLink.select("a").first().attr("href");
5                         break;
6                     }
7                     targetBtFileLinkSuffix = eachBtFileLink.select("a").first().attr("href");
8                 }


每发现一个目标种子就紧接着去访问它的下载页面

通过拦截POST请求和查看页面源代码, 我们发现每个种子的下载页面都有arcid和uhash两个属性, POST的时候必须写入这两个东西才行

首先收集页面中这两个属性的值

1 String arcid = null;
2                 String uhash = null;
3                 boolean hasArcid = false;
4                 boolean hasUhash = false;
5                 while ((temp = bufferedReader.readLine()) != null) {
6                     if (temp.contains("var _arcid")) {
7                         arcid = temp.substring(temp.indexOf("\"")+1, temp.lastIndexOf("\""));
8                         hasArcid = true;
9                     }
10                     if (temp.contains("var _uhash")) {
11                         uhash = temp.substring(temp.indexOf("\"")+1, temp.lastIndexOf("\""));
12                         hasUhash = true;
13                     }
14                 }


如果两个属性值都拿到了就可以提交POST请求去下载种子文件啦

1 if (hasArcid && hasUhash) {
2                     URL requestUrl = new URL(REQUEST_URL);
3                     HttpURLConnection httpUrlConnection = (HttpURLConnection) requestUrl.openConnection();
4                     httpUrlConnection.setInstanceFollowRedirects(false);
5                     httpUrlConnection.setDoOutput(true);
6                     httpUrlConnection.setRequestMethod("POST");
7                     String OUTPUT_DATA =
8                             "action=download" +
9                             "&id="            +
10                             arcid             +
11                             "&uhash="         +
12                             uhash;
13                     OutputStreamWriter outputStreamWriter = new OutputStreamWriter(
14                             httpUrlConnection.getOutputStream());
15                     outputStreamWriter.write(OUTPUT_DATA);
16                     outputStreamWriter.flush();
17                     outputStreamWriter.close();
18                     System.out.print("\r");
19                     System.out.print("[o] Downloading torrent files...("+i+"/"+validFilmSubURLs.size()+")");
20                     File file = new File(uhash+".torrent");
21                     InputStream inputStream = httpUrlConnection.getInputStream();
22                     FileOutputStream fileOutputStream = new FileOutputStream(file);
23                     byte[] buffer = new byte[1024];
24                     int length;
25                     while ((length = inputStream.read(buffer)) != -1) {
26                         fileOutputStream.write(buffer, 0, length);
27                         fileOutputStream.flush();
28                     }
29                     fileOutputStream.close();
30                     httpUrlConnection.disconnect();
31                 }


到此为止种子收集的模块就初具雏形

标签: Java, HTML, GitHub
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: