您的位置：首页 > Web前端 > HTML

HTMLParser使用详解（4） - 通过Visitor访问内容

2014-05-29 13:16 549 查看

HTMLParser使用详解（4） - 通过Visitor访问内容
   HTMLParser遍历了网页的内容以后，以树（森林）结构保存了结果。HTMLParser访问结果内容的方法有两种。使用Filter和使用Visitor。

下面介绍使用Visitor访问内容的方法。

    4.1 NodeVisitor

从简单方面的理解，Filter是根据某种条件过滤取出需要的Node再进行处理。Visitor则是遍历内容树的每一个节点，对于符合条件的节点进行处理。
实际的结果异曲同工，两种不同的方法可以达到相同的结果。

下面是一个最常见的NodeVisitro的例子。

测试代码：

public static void main(String[] args) {

    try{

Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() );

NodeVisitor visitor = new NodeVisitor( false, false ) {

public void visitTag(Tag tag) { message("This is Tag:"+tag.getText());   }

public void visitStringNode (Text string) {  message("This is Text:"+string);   }

public void visitRemarkNode (Remark remark) { message("This is Remark:"+remark.getText()); }

public void beginParsing () { message("beginParsing"); }

public void visitEndTag (Tag tag){ message("visitEndTag:"+tag.getText());   }

public void finishedParsing () { message("finishedParsing");   }

     };
    parser.visitAllNodesWith(visitor);

   }
     catch( Exception e ) { e.printStackTrace();  }

}

输出结果：

beginParsing

This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

This is Text:Txt (121[0,121],123[1,0]): \n

This is Text:Txt (244[1,121],246[2,0]): \n

finishedParsing

可以看到，开始遍历所以的节点以前，beginParsing先被调用，然后处理的是中间的Node，最后在结束遍历以前，finishParsing被调用。
因为我设置的 recurseChildren和recurseSelf都是false，所以Visitor没有访问子节点也没有访问根节点的内容。中间输出的两个\n就是我们在
HTMLParser使用详解（1）- 初始化Parser 中讨论过的最高层的那两个换行。
    我们先把recurseSelf设置成true，看看会发生什么。

NodeVisitor visitor = new NodeVisitor( false, true) {

输出结果：

beginParsing

This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

This is Text:Txt (121[0,121],123[1,0]): \n

This is Tag:head

This is Text:Txt (244[1,121],246[2,0]): \n

This is Tag:html xmlns="http://www.w3.org/1999/xhtml"

finishedParsing

可以看到，HTML页面的第一层节点都被调用了。

我们再用下面的方法调用看看：

NodeVisitor visitor = new NodeVisitor( true, false) {

输出结果：

beginParsing

This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

This is Text:Txt (121[0,121],123[1,0]): \n

This is Tag:meta http-equiv="Content-Type" content="text/html; charset=gb2312"

This is Text:Txt (204[1,81],229[1,106]): 白泽居-title-www.baizeju.com

visitEndTag:/title

visitEndTag:/head

This is Text:Txt (244[1,121],246[2,0]): \n

This is Text:Txt (289[2,43],291[3,0]): \n

This is Text:Txt (298[3,7],300[4,0]): \n

This is Text:Txt (319[4,19],322[5,1]): \n\t

This is Text:Txt (342[5,21],346[6,2]): \n\t\t

This is Remark:这是注释白泽居-www.baizeju.com

This is Text:Txt (378[6,34],408[8,0]): \n\t\t白泽居-字符串1-www.baizeju.com\n

This is Text:Txt (441[8,33],465[8,57]): 白泽居-链接文本-www.baizeju.com

visitEndTag:/a

This is Text:Txt (469[8,61],472[9,1]): \n\t

visitEndTag:/div

This is Text:Txt (478[9,7],507[11,0]): \n\t白泽居-字符串2-www.baizeju.com\n

visitEndTag:/div

This is Text:Txt (513[11,6],515[12,0]): \n

visitEndTag:/body

This is Text:Txt (522[12,7],524[13,0]): \n

visitEndTag:/html

finishedParsing

可以看到，所有的子节点都出现了，除了刚刚例子里面的两个最上层节点This is Tag:head和This is Tag:html xmlns="http://www.w3.org/1999/xhtml"。

想让它们都出来，只需要

     NodeVisitor visitor = new NodeVisitor( true, true) {

输出结果：

beginParsing

This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

This is Text:Txt (121[0,121],123[1,0]): \n

This is Tag:head

This is Tag:meta http-equiv="Content-Type" content="text/html; charset=gb2312"

This is Tag:title

This is Text:Txt (204[1,81],229[1,106]): 白泽居-title-www.baizeju.com

visitEndTag:/title

visitEndTag:/head

This is Text:Txt (244[1,121],246[2,0]): \n

This is Tag:html xmlns="http://www.w3.org/1999/xhtml"

This is Text:Txt (289[2,43],291[3,0]): \n

This is Tag:body

This is Text:Txt (298[3,7],300[4,0]): \n

This is Tag:div id="top_main"

This is Text:Txt (319[4,19],322[5,1]): \n\t

This is Tag:div id="logoindex"

This is Text:Txt (342[5,21],346[6,2]): \n\t\t

This is Remark:这是注释白泽居-www.baizeju.com

This is Text:Txt (378[6,34],408[8,0]): \n\t\t白泽居-字符串1-www.baizeju.com\n

This is Tag:a href="http://www.baizeju.com"

This is Text:Txt (441[8,33],465[8,57]): 白泽居-链接文本-www.baizeju.com

visitEndTag:/a

This is Text:Txt (469[8,61],472[9,1]): \n\t

visitEndTag:/div

This is Text:Txt (478[9,7],507[11,0]): \n\t白泽居-字符串2-www.baizeju.com\n

visitEndTag:/div

This is Text:Txt (513[11,6],515[12,0]): \n

visitEndTag:/body

This is Text:Txt (522[12,7],524[13,0]): \n

visitEndTag:/html

finishedParsing

哈哈，这下调用清楚了，大家在需要处理的地方增加自己的代码好了。

    4.2 其他Visitor

HTMLParser还定义了几个其他的Visitor。HtmlPage，NodeVisitor，ObjectFindingVisitor，StringFindingVisitor，TagFindingVisitor，TextExtractingVisitor，UrlModifyingVisitor，
它们都是NodeVisitor的子类，实现了一些特定的功能。笔者个人的感觉是没什么用处，如果你需要什么特定的功能，还不如自己写一个，想在这些里面找到适合你需要的，化的
时间可能更多。反正大家看看代码就发现，它们每个都没几行真正有效的代码。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航