Java的UUID生成工具并发测试
2010-07-24 13:27
483 查看
Showdown - Java HTML Parsing Comparison
February 2, 2008 at 4:10 pm · Filed under Java
I had to do some HTML parsing today, but unfortunately most HTML on the web is not well-formed like the markup created here at Lumidant. Missing end tags and other broken syntax throws a wrench into the situation. Luckily, others have already addressed this issue. Many times over in fact, leaving many to wonder which solution to implement.
Once you parse HTML, you can do some cool stuff with it like transform it or extract some information. For that reason it is sometimes used for screen scraping. So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. I know that there are many others I could have chosen from as well, but this seemed to be a good sampling and there’s only so much time in the day. I also chose 10 URLs to parse. Being a true Clevelander I picked the sites of a number of local attractions. I’m right near all of the stadiums, so the Quicken Loans Arena website was my first target. I sometimes jokingly refer to my city as the “Mistake on the Lake” and the pure awfulness of the HTML from my city did not fail me. The ten URLs I chose are:
http://www.theqarena.com http://cleveland.indians.mlb.com http://www.clevelandbrowns.com http://www.cbgarden.org http://www.clemetzoo.com http://www.cmnh.org http://www.clevelandart.org http://www.mocacleveland.org http://www.glsc.org http://www.rockhall.com
I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. This was a design tip fresh in my mind from reading my all-time favorite technical book: Effective Java by Josh Bloch. The implementation specific code for each library is below:
NekoHTML:
final DOMParser parser = new DOMParser();
try {
parser.parse(new InputSource(urlIS));
document = parser.getDocument();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
TagSoup:
final Parser parser = new Parser();
SAX2DOM sax2dom = null;
try {
sax2dom = new SAX2DOM();
parser.setContentHandler(sax2dom);
parser.setFeature(Parser.namespacesFeature, false);
parser.parse(new InputSource(urlIS));
} catch (Exception e) {
e.printStackTrace();
}
document = sax2dom.getDOM();
jTidy:
final Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setForceOutput(true);
document = tidy.parseDOM(urlIS, null);
HtmlCleaner:
final HtmlCleaner cleaner = new HtmlCleaner(urlIS);
try {
cleaner.clean();
document = cleaner.createDOM();
} catch (Exception e) {
e.printStackTrace();
}
Finally, to judge the ability to parse the HTML, I ran the XQuery “//a” to grab all the <a> tags from the document. The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. Most of the others were not able to make it past even the very first link I provided, which was to Quicken Loans Arena site. HtmlCleaner’s full results:
Found 87 links at http://www.theqarena.com/ Found 156 links at http://cleveland.indians.mlb.com/ Found 96 links at http://www.clevelandbrowns.com/ Found 106 links at http://www.cbgarden.org/ Found 70 links at http://www.clemetzoo.com/ Found 23 links at http://www.cmnh.org/site/ Found 27 links at http://www.clevelandart.org/ Found 51 links at http://www.mocacleveland.org/ Found 27 links at http://www.glsc.org/ Found 90 links at http://www.rockhall.com/
One disclaimer that I will make is that I did not go out of my way to improve the performance of any of these libraries. Some of them had additional options that could be set to possibly improve performance. I did not delve into wading through the documentation to figure out what these options were and simply used the plain vanilla incantations. HtmlCleaner seems to offer me everything I need and was quick and easy to implement.
转自: http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
February 2, 2008 at 4:10 pm · Filed under Java
I had to do some HTML parsing today, but unfortunately most HTML on the web is not well-formed like the markup created here at Lumidant. Missing end tags and other broken syntax throws a wrench into the situation. Luckily, others have already addressed this issue. Many times over in fact, leaving many to wonder which solution to implement.
Once you parse HTML, you can do some cool stuff with it like transform it or extract some information. For that reason it is sometimes used for screen scraping. So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. I know that there are many others I could have chosen from as well, but this seemed to be a good sampling and there’s only so much time in the day. I also chose 10 URLs to parse. Being a true Clevelander I picked the sites of a number of local attractions. I’m right near all of the stadiums, so the Quicken Loans Arena website was my first target. I sometimes jokingly refer to my city as the “Mistake on the Lake” and the pure awfulness of the HTML from my city did not fail me. The ten URLs I chose are:
http://www.theqarena.com http://cleveland.indians.mlb.com http://www.clevelandbrowns.com http://www.cbgarden.org http://www.clemetzoo.com http://www.cmnh.org http://www.clevelandart.org http://www.mocacleveland.org http://www.glsc.org http://www.rockhall.com
I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. This was a design tip fresh in my mind from reading my all-time favorite technical book: Effective Java by Josh Bloch. The implementation specific code for each library is below:
NekoHTML:
final DOMParser parser = new DOMParser();
try {
parser.parse(new InputSource(urlIS));
document = parser.getDocument();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
TagSoup:
final Parser parser = new Parser();
SAX2DOM sax2dom = null;
try {
sax2dom = new SAX2DOM();
parser.setContentHandler(sax2dom);
parser.setFeature(Parser.namespacesFeature, false);
parser.parse(new InputSource(urlIS));
} catch (Exception e) {
e.printStackTrace();
}
document = sax2dom.getDOM();
jTidy:
final Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setForceOutput(true);
document = tidy.parseDOM(urlIS, null);
HtmlCleaner:
final HtmlCleaner cleaner = new HtmlCleaner(urlIS);
try {
cleaner.clean();
document = cleaner.createDOM();
} catch (Exception e) {
e.printStackTrace();
}
Finally, to judge the ability to parse the HTML, I ran the XQuery “//a” to grab all the <a> tags from the document. The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. Most of the others were not able to make it past even the very first link I provided, which was to Quicken Loans Arena site. HtmlCleaner’s full results:
Found 87 links at http://www.theqarena.com/ Found 156 links at http://cleveland.indians.mlb.com/ Found 96 links at http://www.clevelandbrowns.com/ Found 106 links at http://www.cbgarden.org/ Found 70 links at http://www.clemetzoo.com/ Found 23 links at http://www.cmnh.org/site/ Found 27 links at http://www.clevelandart.org/ Found 51 links at http://www.mocacleveland.org/ Found 27 links at http://www.glsc.org/ Found 90 links at http://www.rockhall.com/
One disclaimer that I will make is that I did not go out of my way to improve the performance of any of these libraries. Some of them had additional options that could be set to possibly improve performance. I did not delve into wading through the documentation to figure out what these options were and simply used the plain vanilla incantations. HtmlCleaner seems to offer me everything I need and was quick and easy to implement.
转自: http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
相关文章推荐
- Java的UUID生成工具并发测试
- Java的UUID生成工具并发测试
- Java的UUID生成工具并发测试
- Java安全工具,生成MD5,Base64,UUID
- Java 常用工具类(11) : UUID生成工具类
- 年底收藏系列-Java安全工具,生成MD5,Base64,UUID
- java连接oracle数据库(利用MyEclipse开发工具反向生成 bean与配置文件)
- Java程序员的好工具:通过json生成javabean(pojo)
- UUID.java-J2ME 中javax.bluetooth里的UUID生成类
- Eclipse java项目打包工具(fatjar)、Java EXE 启动文件生成程序
- Hibernate工具生成.hbm.xml及.java文件
- java 生成UUID
- java生成exe工具之exe4j.exe(Java Exe Maker)的使用和注意事项
- JAVA UUID 生成
- java自带工具keytool生成keystore
- JAVA UUID 生成策略研究之为什么
- 用Java生成全局序列UUID
- java代码生成UUID以及在线UUID生成器
- java利用zxing开源工具生成二维码QRCode
- UUIDUtils工具随即生成UUID字符串