JSOUP初探
2014-01-05 13:34
465 查看
JSOUP是偶然看到的一个处理HTML的J***A 类库,其官方网址是:http://jsoup.org/
1、编写相关的试用程序(只需要在工程中引用jsoup-1.3.3.jar即可):
[java]
view plaincopyprint?
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) {
Test t = new Test();
t.parseFile();
}
public void parseString() {
String html = "<html><head><title>blog</title></head><body onload='test()'><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
System.out.println(doc);
Elements es = doc.body().getAllElements();
System.out.println(es.attr("onload"));
System.out.println(es.select("p"));
}
public void parseUrl() {
try {
Document doc = Jsoup.connect("http://www.baidu.com/").get();
Elements hrefs = doc.select("a[href]");
System.out.println(hrefs);
System.out.println("------------------");
System.out.println(hrefs.select("[href^=http]"));
} catch (IOException e) {
e.printStackTrace();
}
}
public void parseFile() {
try {
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8");
// 提取出所有的编号
Elements codes = doc.body().select("td[title^=IA] > a[href^=javascript:view]");
System.out.println(codes);
System.out.println("------------------");
System.out.println(codes.html());
} catch (IOException e) {
e.printStackTrace();
}
}
}
2、parseString的输出:
[java]
view plaincopyprint?
<html>
<head>
<title>blog</title>
</head>
<body onload="test()">
<p>Parsed HTML into a doc.</p>
</body>
</html>
test()
<p>Parsed HTML into a doc.</p>
3、parseUrl的输出:
[java]
view plaincopyprint?
<a href="/gaoji/preferences.html">设置</a>
<a href="http://passport.baidu.com/?login&tpl=mn">登录</a>
<a href="http://news.baidu.com">新 闻</a>
<a href="http://tieba.baidu.com">贴 吧</a>
<a href="http://zhidao.baidu.com">知 道</a>
<a href="http://mp3.baidu.com">MP3</a>
<a href="http://image.baidu.com">图 片</a>
<a href="http://video.baidu.com">视 频</a>
<a href="http://map.baidu.com">地 图</a>
<a href="#" name="ime_hw">手写</a>
<a href="#" name="ime_py">拼音</a>
<a href="#" name="ime_cl">关闭</a>
<a href="http://hi.baidu.com">空间</a>
<a href="http://baike.baidu.com">百科</a>
<a href="http://www.hao123.com">hao123</a>
<a href="/more/">更多>></a>
<a id="st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页</a>
<a href="http://e.baidu.com/?refer=888">加入百度推广</a>
<a href="http://top.baidu.com">搜索风云榜</a>
<a href="http://home.baidu.com">关于百度</a>
<a href="http://ir.baidu.com">About Baidu</a>
<a href="/duty/">使用百度前必读</a>
<a href="http://www.miibeian.gov.cn" target="_blank">京ICP证030173号</a>
------------------
<a href="http://passport.baidu.com/?login&tpl=mn">登录</a>
<a href="http://news.baidu.com">新 闻</a>
<a href="http://tieba.baidu.com">贴 吧</a>
<a href="http://zhidao.baidu.com">知 道</a>
<a href="http://mp3.baidu.com">MP3</a>
<a href="http://image.baidu.com">图 片</a>
<a href="http://video.baidu.com">视 频</a>
<a href="http://map.baidu.com">地 图</a>
<a href="http://hi.baidu.com">空间</a>
<a href="http://baike.baidu.com">百科</a>
<a href="http://www.hao123.com">hao123</a>
<a id="st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页</a>
<a href="http://e.baidu.com/?refer=888">加入百度推广</a>
<a href="http://top.baidu.com">搜索风云榜</a>
<a href="http://home.baidu.com">关于百度</a>
<a href="http://ir.baidu.com">About Baidu</a>
<a href="http://www.miibeian.gov.cn" target="_blank">京ICP证030173号</a>
3、parseFile的输出:
[java]
view plaincopyprint?
<a href="javascript:view('67530','67530','0');">IA100908-002</a>
<a href="javascript:view('67529','67529','0');">IA100908-001</a>
<a href="javascript:view('67544','67544','0');">IA100908-016</a>
<a href="javascript:view('67364','67364','0');">IA100903-008</a>
<a href="javascript:view('67363','67363','0');">IA100903-007</a>
<a href="javascript:view('66104','66104','0');">IA100710-013</a>
<a href="javascript:view('57916','57916','0');">IA100515-013</a>
<a href="javascript:view('56962','56962','0');">IA100430-022</a>
<a href="javascript:view('66958','66958','0');">IA100830-001</a>
<a href="javascript:view('66319','66319','0');">IA100713-003</a>
<a href="javascript:view('66317','66317','0');">IA100713-001</a>
<a href="javascript:view('66321','66321','0');">IA100713-005</a>
<a href="javascript:view('66967','66967','0');">IA100830-010</a>
<a href="javascript:view('66999','66999','0');">IA100831-001</a>
<a href="javascript:view('67377','67377','0');">IA100904-004</a>
<a href="javascript:view('67378','67378','0');">IA100904-005</a>
<a href="javascript:view('3271','3271','0');">IA080115-031</a>
------------------
IA100908-002
IA100908-001
IA100908-016
IA100903-008
IA100903-007
IA100710-013
IA100515-013
IA100430-022
IA100830-001
IA100713-003
IA100713-001
IA100713-005
IA100830-010
IA100831-001
IA100904-004
IA100904-005
IA080115-031
补充下,input.html的基本结果如图:
1、编写相关的试用程序(只需要在工程中引用jsoup-1.3.3.jar即可):
[java]
view plaincopyprint?
import java.io.File;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) {
Test t = new Test();
t.parseFile();
}
public void parseString() {
String html = "<html><head><title>blog</title></head><body onload='test()'><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
System.out.println(doc);
Elements es = doc.body().getAllElements();
System.out.println(es.attr("onload"));
System.out.println(es.select("p"));
}
public void parseUrl() {
try {
Document doc = Jsoup.connect("http://www.baidu.com/").get();
Elements hrefs = doc.select("a[href]");
System.out.println(hrefs);
System.out.println("------------------");
System.out.println(hrefs.select("[href^=http]"));
} catch (IOException e) {
e.printStackTrace();
}
}
public void parseFile() {
try {
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8");
// 提取出所有的编号
Elements codes = doc.body().select("td[title^=IA] > a[href^=javascript:view]");
System.out.println(codes);
System.out.println("------------------");
System.out.println(codes.html());
} catch (IOException e) {
e.printStackTrace();
}
}
}
import java.io.File; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.select.Elements; public class Test { public static void main(String[] args) { Test t = new Test(); t.parseFile(); } public void parseString() { String html = "<html><head><title>blog</title></head><body onload='test()'><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html); System.out.println(doc); Elements es = doc.body().getAllElements(); System.out.println(es.attr("onload")); System.out.println(es.select("p")); } public void parseUrl() { try { Document doc = Jsoup.connect("http://www.baidu.com/").get(); Elements hrefs = doc.select("a[href]"); System.out.println(hrefs); System.out.println("------------------"); System.out.println(hrefs.select("[href^=http]")); } catch (IOException e) { e.printStackTrace(); } } public void parseFile() { try { File input = new File("input.html"); Document doc = Jsoup.parse(input, "UTF-8"); // 提取出所有的编号 Elements codes = doc.body().select("td[title^=IA] > a[href^=javascript:view]"); System.out.println(codes); System.out.println("------------------"); System.out.println(codes.html()); } catch (IOException e) { e.printStackTrace(); } } }
2、parseString的输出:
[java]
view plaincopyprint?
<html>
<head>
<title>blog</title>
</head>
<body onload="test()">
<p>Parsed HTML into a doc.</p>
</body>
</html>
test()
<p>Parsed HTML into a doc.</p>
<html> <head> <title>blog</title> </head> <body onload="test()"> <p>Parsed HTML into a doc.</p> </body> </html> test() <p>Parsed HTML into a doc.</p>
3、parseUrl的输出:
[java]
view plaincopyprint?
<a href="/gaoji/preferences.html">设置</a>
<a href="http://passport.baidu.com/?login&tpl=mn">登录</a>
<a href="http://news.baidu.com">新 闻</a>
<a href="http://tieba.baidu.com">贴 吧</a>
<a href="http://zhidao.baidu.com">知 道</a>
<a href="http://mp3.baidu.com">MP3</a>
<a href="http://image.baidu.com">图 片</a>
<a href="http://video.baidu.com">视 频</a>
<a href="http://map.baidu.com">地 图</a>
<a href="#" name="ime_hw">手写</a>
<a href="#" name="ime_py">拼音</a>
<a href="#" name="ime_cl">关闭</a>
<a href="http://hi.baidu.com">空间</a>
<a href="http://baike.baidu.com">百科</a>
<a href="http://www.hao123.com">hao123</a>
<a href="/more/">更多>></a>
<a id="st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页</a>
<a href="http://e.baidu.com/?refer=888">加入百度推广</a>
<a href="http://top.baidu.com">搜索风云榜</a>
<a href="http://home.baidu.com">关于百度</a>
<a href="http://ir.baidu.com">About Baidu</a>
<a href="/duty/">使用百度前必读</a>
<a href="http://www.miibeian.gov.cn" target="_blank">京ICP证030173号</a>
------------------
<a href="http://passport.baidu.com/?login&tpl=mn">登录</a>
<a href="http://news.baidu.com">新 闻</a>
<a href="http://tieba.baidu.com">贴 吧</a>
<a href="http://zhidao.baidu.com">知 道</a>
<a href="http://mp3.baidu.com">MP3</a>
<a href="http://image.baidu.com">图 片</a>
<a href="http://video.baidu.com">视 频</a>
<a href="http://map.baidu.com">地 图</a>
<a href="http://hi.baidu.com">空间</a>
<a href="http://baike.baidu.com">百科</a>
<a href="http://www.hao123.com">hao123</a>
<a id="st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页</a>
<a href="http://e.baidu.com/?refer=888">加入百度推广</a>
<a href="http://top.baidu.com">搜索风云榜</a>
<a href="http://home.baidu.com">关于百度</a>
<a href="http://ir.baidu.com">About Baidu</a>
<a href="http://www.miibeian.gov.cn" target="_blank">京ICP证030173号</a>
<a href="/gaoji/preferences.html">设置</a> <a href="http://passport.baidu.com/?login&tpl=mn">登录</a> <a href="http://news.baidu.com">新 闻</a> <a href="http://tieba.baidu.com">贴 吧</a> <a href="http://zhidao.baidu.com">知 道</a> <a href="http://mp3.baidu.com">MP3</a> <a href="http://image.baidu.com">图 片</a> <a href="http://video.baidu.com">视 频</a> <a href="http://map.baidu.com">地 图</a> <a href="#" name="ime_hw">手写</a> <a href="#" name="ime_py">拼音</a> <a href="#" name="ime_cl">关闭</a> <a href="http://hi.baidu.com">空间</a> <a href="http://baike.baidu.com">百科</a> <a href="http://www.hao123.com">hao123</a> <a href="/more/">更多>></a> <a id="st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页</a> <a href="http://e.baidu.com/?refer=888">加入百度推广</a> <a href="http://top.baidu.com">搜索风云榜</a> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> <a href="/duty/">使用百度前必读</a> <a href="http://www.miibeian.gov.cn" target="_blank">京ICP证030173号</a> ------------------ <a href="http://passport.baidu.com/?login&tpl=mn">登录</a> <a href="http://news.baidu.com">新 闻</a> <a href="http://tieba.baidu.com">贴 吧</a> <a href="http://zhidao.baidu.com">知 道</a> <a href="http://mp3.baidu.com">MP3</a> <a href="http://image.baidu.com">图 片</a> <a href="http://video.baidu.com">视 频</a> <a href="http://map.baidu.com">地 图</a> <a href="http://hi.baidu.com">空间</a> <a href="http://baike.baidu.com">百科</a> <a href="http://www.hao123.com">hao123</a> <a id="st" onclick="this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.baidu.com')" href="http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com">把百度设为主页</a> <a href="http://e.baidu.com/?refer=888">加入百度推广</a> <a href="http://top.baidu.com">搜索风云榜</a> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> <a href="http://www.miibeian.gov.cn" target="_blank">京ICP证030173号</a>
3、parseFile的输出:
[java]
view plaincopyprint?
<a href="javascript:view('67530','67530','0');">IA100908-002</a>
<a href="javascript:view('67529','67529','0');">IA100908-001</a>
<a href="javascript:view('67544','67544','0');">IA100908-016</a>
<a href="javascript:view('67364','67364','0');">IA100903-008</a>
<a href="javascript:view('67363','67363','0');">IA100903-007</a>
<a href="javascript:view('66104','66104','0');">IA100710-013</a>
<a href="javascript:view('57916','57916','0');">IA100515-013</a>
<a href="javascript:view('56962','56962','0');">IA100430-022</a>
<a href="javascript:view('66958','66958','0');">IA100830-001</a>
<a href="javascript:view('66319','66319','0');">IA100713-003</a>
<a href="javascript:view('66317','66317','0');">IA100713-001</a>
<a href="javascript:view('66321','66321','0');">IA100713-005</a>
<a href="javascript:view('66967','66967','0');">IA100830-010</a>
<a href="javascript:view('66999','66999','0');">IA100831-001</a>
<a href="javascript:view('67377','67377','0');">IA100904-004</a>
<a href="javascript:view('67378','67378','0');">IA100904-005</a>
<a href="javascript:view('3271','3271','0');">IA080115-031</a>
------------------
IA100908-002
IA100908-001
IA100908-016
IA100903-008
IA100903-007
IA100710-013
IA100515-013
IA100430-022
IA100830-001
IA100713-003
IA100713-001
IA100713-005
IA100830-010
IA100831-001
IA100904-004
IA100904-005
IA080115-031
<a href="javascript:view('67530','67530','0');">IA100908-002</a> <a href="javascript:view('67529','67529','0');">IA100908-001</a> <a href="javascript:view('67544','67544','0');">IA100908-016</a> <a href="javascript:view('67364','67364','0');">IA100903-008</a> <a href="javascript:view('67363','67363','0');">IA100903-007</a> <a href="javascript:view('66104','66104','0');">IA100710-013</a> <a href="javascript:view('57916','57916','0');">IA100515-013</a> <a href="javascript:view('56962','56962','0');">IA100430-022</a> <a href="javascript:view('66958','66958','0');">IA100830-001</a> <a href="javascript:view('66319','66319','0');">IA100713-003</a> <a href="javascript:view('66317','66317','0');">IA100713-001</a> <a href="javascript:view('66321','66321','0');">IA100713-005</a> <a href="javascript:view('66967','66967','0');">IA100830-010</a> <a href="javascript:view('66999','66999','0');">IA100831-001</a> <a href="javascript:view('67377','67377','0');">IA100904-004</a> <a href="javascript:view('67378','67378','0');">IA100904-005</a> <a href="javascript:view('3271','3271','0');">IA080115-031</a> ------------------ IA100908-002 IA100908-001 IA100908-016 IA100903-008 IA100903-007 IA100710-013 IA100515-013 IA100430-022 IA100830-001 IA100713-003 IA100713-001 IA100713-005 IA100830-010 IA100831-001 IA100904-004 IA100904-005 IA080115-031
补充下,input.html的基本结果如图:
![](http://dl.iteye.com/upload/picture/pic/74238/2572f928-80ce-3e5b-acdb-c03b85c9d008.jpg)
相关文章推荐
- jsoup初探
- JSP安全初探
- 初探单点登录 SSO
- Android实战——jsoup实现网络爬虫,糗事百科项目的起步
- 软件缺陷修复流程初探
- Python抓取网页&批量下载文件方法初探(正则表达式+BeautifulSoup) (转)
- Polymer初探
- SIP简介,第1部分:SIP初探
- POJ3321——树状数组_POJ树状数组初探
- Android Jsoup网页爬虫—>程序猿面试指南App
- MDX功能初探
- OpenCv 源码初探之:cvCvtColor
- 初探Lambda表达式-Java多核编程【2】并行与组合行为
- 初探 Windows 7 M3 Build6801 - 通过 VPC 体验 Windows 7 M3
- 内核编程 初探
- 【腾讯TQM】遗传算法在测试中的应用初探
- 初探SharePoint部署 – WSS Solution Package
- jQuery 图片剪裁插件初探之 Jcrop
- jsoup获取网页属性
- COM技术初探(一)