您的位置:首页 > 理论基础 > 计算机网络

网络爬虫之html获取和解析(Java)

2014-11-18 20:41 393 查看
网络爬虫之html获取和解析
       最近同学让帮忙写一个网页分析程序以获取网页中相关内容。为此,忙活了两天,终于写了一个从网页中获取表格内容程序。虽然比较简单,但是想了想,能够为那些想写网络爬虫程序筒子们提供一定帮助。网页分析和内容获取是网络爬虫中必不可少的步骤。
       一个完整的网络爬虫包含了接个步骤:
       1. 获取对应url的html内容。
       2. 分析html内容,获取链接。
       3. 不断迭代前两个步骤,直到喊停。
       其实不难发现,真正关键的是后面两个步骤,第一个是html内容的分析,其次就是迭代算法设计也就是实际的爬虫策略设计。而我们现在主要讨论第一个部分html内容的分析,接下来我们主要介绍网上一个比较直到jar包,htmlparse如何对html进行分析,以获取html想要内容,另外要说明的是,本文并没有对htmlparse进行更详细的描述,只是告知如何使用htmlparse达到你的分析html的目的。更深层次的使用,还需筒子们自己去学习和挖掘。
       在学习之前,先分享一下htmlparse的wiki网址以及下载地址:
              wiki:  http://htmlparser.sourceforge.net/
       下载链接: http://sourceforge.net/projects/htmlparser/files/Integration-Builds/2.0-20060923/
       下载完后,看完htmlparse后,htmlparse包含如下几个jar包:filterbuilder.jar、htmlexer.jar、htmlparser.jar、sitecapture.jar、thumblina.jar。
       htmlexer.jar负责html词法构成,我是这么理解的。整个htmlparser将html每一个标签例如html、p、div、table都称为tag。整个htmlexer可以看作是对html整个页面标签的解释。htmlparser.jar的主要功能则是对html进行解析,然后利用过滤条件获取你想要的内容。
       简要描述后,看看我这两天的工作吧。
       首先首先是获取种子url对应的html内容,这个jdk中包含了对应的url和urlconnection可以帮我们完成,具体看如下代码:
      
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;

public class HtmlRetrieve {

/**
* @param html_url the url of html.
* @return if url is not exist, return null else return the content of html.
* */
public String GetContentOfHtml(String html_url){
URL url;
try {
url = new URL(html_url);
HttpURLConnection urlConn = (HttpURLConnection)url.openConnection();
if(urlConn != null)
{
urlConn.connect();
BufferedReader reader = new BufferedReader(new InputStreamReader(urlConn.getInputStream(),HtmlEncoding.gbk_encoding));
StringBuffer strBuffer = new StringBuffer();
String line;
while((line = reader.readLine())!=null)
{
strBuffer.append(line);
}
urlConn.disconnect();
return strBuffer.toString();
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}

/**
* This function help save html content into a file.
* @param filePath the path of file.
* @param html_content the content of html need to be saved.
*/
public void SaveToFile(String filePath,String html_content)
{
try {
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(new File(filePath)));
bufferedWriter.write(html_content);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

public static void main(String []args){
HtmlRetrieve htR = new HtmlRetrieve();
try {
String content = htR.GetContentOfHtml("http://www.szse.cn/szseWeb/FrontController.szse?ACTIONID=7&CATALOGID=1265_xyjy&txtKsrq=2000-11-08&txtZzrq=2014-11-20&TABKEY=tab1&REPORT_ACTION=navigate&tab1PAGENUM=5");
System.out.println(content);
//NodeList nodelist = parse.parse(null);
//htR.HtmlParse(htR.GetContentOfHtml("http://istock.jrj.com.cn/list,600071.html"));
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}


        两个函数,一个从种子url中获取内容,另一个将获取的内容保存到文件中。个人觉得比较简单,如果有所不明白,可以参考java的api文档。下面代码是对html中表格内容的获取程序。在html表格内容获取,实际是对html中的内容进行了过滤,对与htmlparse支持过滤条件的设置,同时也运行过滤条件组合使用,具体可以自己去看看,htmlparse的filter包中的详细各种filter类。而Parser类是htmlparser至关重要的类,在htmlparse中有几种构建parser的方法,你可以给parser传递一个url参数进行构建,也可以讲html的内容给parser构造函数进行构建。在下面的参考代码中,我采用的是传递html内容,记住记得设置采用的解析编码。另外还可以只给parser传递html的片段代码进行片段代码的解析。
        
import java.net.URL;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.filters.StringFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.nodes.TagNode;
import org.htmlparser.nodes.TextNode;
import org.htmlparser.sax.Attributes;
import org.htmlparser.tags.TableColumn;
import org.htmlparser.tags.TableRow;
import org.htmlparser.tags.TableTag;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.visitors.HtmlPage;

/**
*
* @author kalvin
* this class helps you to get content in html.
*
*/

public class HtmlParse {
public HtmlParse()
{

}
/**
* function: retrieve the table content of html.you can add some filter condition.
* In htmlParse jar, it has many filters to help your filter the conent of html.
* @param html_content
* @param encoding
* @param className
* @param filter_str
* @return
*/
public String ParseHtmlTableFromHtml(String html_content,String encoding,Class className,final String filter_str)
{
Parser parser = Parser.createParser(html_content, encoding);
if(parser != null)
{
NodeClassFilter nodeClassFilter = new NodeClassFilter(className){

private static final long serialVersionUID = 1L;

public boolean accept(Node node){
if(node.getText().startsWith(filter_str))
{
return true;
}
else{
return false;
}
}
};

StringBuffer strBuffer = new StringBuffer();
if(strBuffer != null)
{
try {
NodeList nodeList = parser.extractAllNodesThatMatch(nodeClassFilter);
Node node = null;
if(nodeList != null)
{
int size = nodeList.size();
for(int i = 0; i < size; i++)
{
node = nodeList.elementAt(i);
if(node != null)
{
strBuffer.append(node.toHtml());
}
}
}

4000
} catch (ParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return strBuffer.toString();
}
}
return null;
}

public static void main(String args[])
{
HtmlParse htmlParse = new HtmlParse();
HtmlRetrieve htmlRetrieve = new HtmlRetrieve();
String html_content = htmlRetrieve.GetContentOfHtml("http://istock.jrj.com.cn/list,600071.html");
String filter_str = "table class=\"table\" id=\"topiclisttitle\"";

String table_content = htmlParse.ParseHtmlTableFromHtml(html_content,HtmlEncoding.gbk_encoding,TableTag.class, filter_str);
if(table_content != null)
System.out.println(table_content);
/*Parser parse;
try {
parse = new Parser("http://istock.jrj.com.cn/list,600071.html");
parse.setEncoding("GBK");

NodeFilter nodeFilter = new NodeClassFilter(TableTag.class){
public boolean accept(Node node)
{
if(node.getText().startsWith("table class=\"table\" id=\"topiclisttitle\""))
{
return true;
}else
{
return false;
}
}
};

TagNameFilter tagNameFilter = new TagNameFilter("tr");

TagNameFilter tdTagNameFilter = new TagNameFilter("td");

StringFilter trStringFilter = new StringFilter("cls-data-tr");

HasAttributeFilter attributeFilter = new HasAttributeFilter("class='cls-data-tr'");
AndFilter andFilter = new AndFilter(attributeFilter,trStringFilter);

NodeList nodeList = parse.extractAllNodesThatMatch(nodeFilter);

int size = nodeList.size();
System.out.println(size);

Node node = nodeList.elementAt(0);
System.out.println(node.toHtml());
/*Node node = null;
for(int i= 0; i < size; i++)
{
node = nodeList.elementAt(i);
NodeList tdNodeList = node.getChildren();
tdNodeList = tdNodeList.extractAllNodesThatMatch(tdTagNameFilter);
int tdNodeSize = tdNodeList.size();
for(int j=0; j< tdNodeSize; j++)
{
node = tdNodeList.elementAt(j);
System.out.println(node.toPlainTextString());
System.out.println()
}
XmlDocument document = new XmlDocument();
}*/

//System.out.println(nodes.toString());
/*} catch (ParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

*/

}

}


中间用到了页面编码,还包含一个页面编码定义的类如下:
/**
*
* @author kalvin
* This class support some encodings.
*/
public class HtmlEncoding {
public static final String gbk_encoding = "GBK";
public static final String utf8_encoding = "utf-8";
public static final String utf16_encoding = "utf-16";
}


最后是一个测试类,将获取的table内容进行解析。
import org.dom4j.Document;
import org.dom4j.io.SAXReader;
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.nodes.TagNode;
import org.htmlparser.tags.TableTag;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

/**
*
* @author kalvin
* Html are constructed by mang tags.
* eg:
*   <html><head><title></title></head><body><table></table></body></html>
*   html,head,title,body,table.All of these are tags.
*/

public class ParseTable {

/**
* help you retrieve the value of properties of table.
* @param xml_table
* @param encoding
* @return
*/
public String ParseTableOfHtml(String xml_table,String encoding)
{
//parser is the class of htmlparser jar.
//you can use parser parse html.
Parser parser = null;
parser = Parser.createParser(xml_table, HtmlEncoding.utf16_encoding);

if(parser == null)
return null;
//this is a filter condition.In htmlParser, there are many filter conditions.
//this is a tag filter condition. About detail information, you can refer to
//the htmlparse document.About the document, you can get from 开源中国.
TagNameFilter tagNameFilter = new TagNameFilter("tr");

NodeList nodeList = null;
StringBuffer strBuffer = new StringBuffer();

try {
nodeList = parser.extractAllNodesThatMatch(tagNameFilter);
Node node = null;

if(nodeList != null)
{
int trNodeListSize = nodeList.size();
for(int i=0; i < trNodeListSize; i++)
{
node = nodeList.elementAt(i);
if(node != null)
{
NodeList tdNodeList = node.getChildren();
if(tdNodeList != null)
{
int tdNodeListSize = tdNodeList.size();
for(int j = 0; j< tdNodeListSize; j++)
{
node = tdNodeList.elementAt(j);
if(node != null)
{
strBuffer.append(node.toPlainTextString());
strBuffer.append(" ");

node =node.getFirstChild();
while(node != null)
{
if(node instanceof TagNode)
{
TagNode tagNode = (TagNode)node;
String value = tagNode.getAttribute("href");
if(value != null && !value.equals(""))
{
strBuffer.append(value);
strBuffer.append(" ");
}
}
node = node.getNextSibling();
}
}
}
strBuffer.append("\n");
}
}
}
}

} catch (ParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return strBuffer.toString();
}
public static void main(String []args)
{
HtmlParse htmlParse = new HtmlParse();
HtmlRetrieve htmlRetrieve = new HtmlRetrieve();
//String html_content = htmlRetrieve.GetContentOfHtml("http://istock.jrj.com.cn/list,600071,p1.html");
//String filter_str = "table class=\"table\" id=\"topiclisttitle\"";
String html_content1 = htmlRetrieve.GetContentOfHtml("http://www.szse.cn/szseWeb/FrontController.szse?ACTIONID=7&CATALOGID=1265_xyjy&txtKsrq=2000-11-08&txtZzrq=2014-11-20&TABKEY=tab1&REPORT_ACTION=navigate&tab1PAGENUM=1&txtkey2=000001");
//System.out.println(html_content1);
String filter_str1 = "table   bgcolor=\'#E0E0E0\' id=\"REPORTID_tab1\" class=\'cls-data-table\'";
String table_content = htmlParse.ParseHtmlTableFromHtml(html_content1,HtmlEncoding.gbk_encoding,TableTag.class, filter_str1);

Parser parser = Parser.createParser(table_content, HtmlEncoding.gbk_encoding);
//System.out.println(table_content);

ParseTable parseTable = new ParseTable();
System.out.println(parseTable.ParseTableOfHtml(table_content, HtmlEncoding.gbk_encoding));
/*if(parser != null)
{
TagNameFilter tagNameFilter = new TagNameFilter("tr");
HasAttributeFilter attributeFilter = new HasAttributeFilter("class");

AndFilter andFilter = new AndFilter(tagNameFilter,attributeFilter);
try {
NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter);
Node node = null;
if(nodeList != null)
{
int size = nodeList.size();
for(int i= 0; i < size; i++)
{
node = nodeList.elementAt(i);
if(node != null)
{
NodeList tdNodeList = node.getChildren();
if(tdNodeList != null)
{
int tdNodeSize = tdNodeList.size();
for(int j = 0; j < tdNodeSize; j++)
{
node = tdNodeList.elementAt(j);
if(node != null)
{
if(node instanceof TagNode)
{
TagNode tagNode = (TagNode)node;
System.out.println(tagNode.getAttribute("class"));
}
System.out.println(node.toPlainTextString());

System.out.println(node.toHtml());
NodeList aHrefList = node.getChildren();
int aHrefListSize = aHrefList.size();
for(int k=0; k < aHrefListSize; k++)
{
node = aHrefList.elementAt(k);
if(node instanceof TagNode)
{
TagNode tagNode = (TagNode)node;
System.out.println(tagNode.getAttribute("href"));
}
}

}
}
}
}
}
}
} catch (ParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}*/

//}

}
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  网络爬虫