您的位置:首页 > Web前端 > JavaScript

JSoup 获取正文,自动识别页面编码Charset

2013-06-18 03:29 1141 查看
 
public static String getContent(String url) throws Exception{
HttpClient hc = new HttpClient();
HttpMethod hm = new GetMethod(url);
int statusCode = -1;
byte[] result = null;
statusCode = hc.executeMethod(hm);
if(statusCode != HttpStatus.SC_OK)//判断返回
return "";
if(hm.getResponseBody()!=null){//获取页面数据
result = hm.getResponseBody();//hm.getStatusLine()――http状态和请求结果
}
String charset = JsoupUtils.getCharset(url); //通过jsoup获得页面的charset
hm.releaseConnection();
String data = null;
if(result != null)
data = new String(result,charset);//字符编码设置
return data;
}


[代码] 获得字符集

/**
* 获得字符集
*/
public static String getCharset (String siteurl) throws Exception{
URL url = new URL(siteurl);
Document doc = Jsoup.parse(url, 6*1000);
Elements eles = doc.select("meta[http-equiv=Content-Type]");
Iterator<Element> itor = eles.iterator();
while (itor.hasNext())
return RegularUtils.matchCharset(itor.next().toString());
return "gb2312";
}


[代码] 使用正则表达式获得页面字符

/**
* 获得页面字符
*/
public static String matchCharset(String content) {
String chs = "gb2312";
p = Pattern.compile("(?<=charset=)(.+)(?=\")");
Matcher m = p.matcher(content);
if (m.find())
return m.group();
return chs;
}


 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  jsoup html解析