您的位置:首页 > 理论基础 > 计算机网络

利用httpclient+jsoup解析页面

2012-03-24 15:03 316 查看
步骤:

1. 设置url:HttpPost httpPost = new HttpPost(String url);

//当url带参数时使用 HttpGet httpget = new HttpGet(url);

2. 设置参数(使用HttpGet时无需设置):

List<NameValuePair> params = new ArrayList<NameValuePair>();

params.add(new BasicNameValuePair(String arg0, String arg0Value));

params.add......

  httpPost.setEntity(new UrlEncodedFormEntity(params,"GB2312"));

3.执行请求:

HttpClient httpClient = new DefaultHttpClient();

HttpResponse rps0 = httpClient.execute(httpPost);

  //可以利用返回码判断请求是否成功再在if内部实现下一步

   int resStatu = responce.getStatusLine().getStatusCode();// 返回码
    if (resStatu == HttpStatus.SC_OK) {

    }

4.获取html:

HttpEntity entity0 = rps0.getEntity();

  String html = EntityUtils.toString(entity0);

5.关闭连接:

  httpClient.getConnectionManager().shutdown();

6.解析html:

  Document doc = Jsoup.parse(html);

7.其他

如果拿到的html是乱码 要进行转码

Document doc = Jsoup.parse(html);
Element e = doc.getElementsByTag("meta").first();
if(e != null){
String content = "";
String charset = "";
if(e.attr("content") != null && e.attr("content") != ""){
content = e.attr("content");
charset = content.substring(content.indexOf("=")+1);

}
else if(e.attr("charset") != null && e.attr("charset") != "")charset = e.attr("charset");
else charset = "GBK";
System.out.println(charset);
text = new String(html.getBytes("ISO-8859-1"),charset);
//// System.out.println(content.substring(content.indexOf("=")+1));

//// System.out.println(new String(html.getBytes("ISO-8859-1"),content.substring(content.indexOf("=")+1)));
}
else
{text = new String(html.getBytes("ISO-8859-1"),"GBK");}//如果拿不到原页面的编码格式,默认为GBK
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: