您的位置：首页 > Web前端 > JavaScript

JSoup 获取正文，自动识别页面编码Charset

2013-06-18 03:29 1141 查看

public static String getContent(String url) throws Exception{
HttpClient hc = new HttpClient();
HttpMethod hm = new GetMethod(url);
int statusCode = -1;
byte[] result = null;
statusCode = hc.executeMethod(hm);
if(statusCode != HttpStatus.SC_OK)//判断返回
return "";
if(hm.getResponseBody()!=null){//获取页面数据
result = hm.getResponseBody();//hm.getStatusLine()――http状态和请求结果
}
String charset = JsoupUtils.getCharset(url); //通过jsoup获得页面的charset
hm.releaseConnection();
String data = null;
if(result != null)
data = new String(result,charset);//字符编码设置
return data;
}

[代码] 获得字符集

/**
* 获得字符集
*/
public static String getCharset (String siteurl) throws Exception{
URL url = new URL(siteurl);
Document doc = Jsoup.parse(url, 6*1000);
Elements eles = doc.select("meta[http-equiv=Content-Type]");
Iterator<Element> itor = eles.iterator();
while (itor.hasNext())
return RegularUtils.matchCharset(itor.next().toString());
return "gb2312";
}

[代码] 使用正则表达式获得页面字符

/**
* 获得页面字符
*/
public static String matchCharset(String content) {
String chs = "gb2312";
p = Pattern.compile("(?<=charset=)(.+)(?=\")");
Matcher m = p.matcher(content);
if (m.find())
return m.group();
return chs;
}

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： jsoup html解析

相关文章推荐

新的分享

章节导航