您的位置：首页 > 理论基础 > 计算机网络

利用httpclient+jsoup解析页面

2012-03-24 15:03 316 查看

步骤：

1. 设置url：HttpPost httpPost = new HttpPost(String url);

//当url带参数时使用 HttpGet httpget = new HttpGet(url);

2. 设置参数（使用HttpGet时无需设置）：

List<NameValuePair> params = new ArrayList<NameValuePair>();

params.add(new BasicNameValuePair(String arg0, String arg0Value));

params.add......

　 httpPost.setEntity(new UrlEncodedFormEntity(params,"GB2312"));

3.执行请求：

HttpClient httpClient = new DefaultHttpClient();

HttpResponse rps0 = httpClient.execute(httpPost);

　　//可以利用返回码判断请求是否成功再在if内部实现下一步

　　 int resStatu = responce.getStatusLine().getStatusCode();// 返回码
　　　　if (resStatu == HttpStatus.SC_OK) {

　　　　}

4.获取html：

HttpEntity entity0 = rps0.getEntity();

　 String html = EntityUtils.toString(entity0);

5.关闭连接：

　　httpClient.getConnectionManager().shutdown();

6.解析html：

　　Document doc = Jsoup.parse(html);

7.其他

如果拿到的html是乱码要进行转码

Document doc = Jsoup.parse(html);
Element e = doc.getElementsByTag("meta").first();
if(e != null){
String content = "";
String charset = "";
if(e.attr("content") != null && e.attr("content") != ""){
content = e.attr("content");
charset = content.substring(content.indexOf("=")+1);

}
else if(e.attr("charset") != null && e.attr("charset") != "")charset = e.attr("charset");
else charset = "GBK";
System.out.println(charset);
text = new String(html.getBytes("ISO-8859-1"),charset);
//// System.out.println(content.substring(content.indexOf("=")+1));

//// System.out.println(new String(html.getBytes("ISO-8859-1"),content.substring(content.indexOf("=")+1)));
}
else
{text = new String(html.getBytes("ISO-8859-1"),"GBK");}//如果拿不到原页面的编码格式，默认为GBK

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航