您的位置:首页 > 其它

how to get charset from string and file

2014-04-11 22:10 531 查看
a.get charset from string
public String getCharsetFromString(String srcString) throws IOException {
BufferedInputStream bin = new BufferedInputStream(new ByteArrayInputStream(
srcString.getBytes()));
int p = (bin.read() << 8) + bin.read();
String code = null;
//the 0xefbb、0xfffe、0xfeff、0x5c75 at the beginning of each string, can be used to defines the char set
switch (p) {
case 0xefbb:
code = "UTF-8";
break;
case 0xfffe:
code = "Unicode";
break;
case 0xfeff:
code = "UTF-16BE";
break;
case 0x5c75:
code = "ANSI|ASCII";
break;
default:
code = "ISO-8859-1";
}
return code;
}

b.get charset from file(not sure)
public String getCharsetFromFile(String filePath)throwsIOException{
FileInputStream fis =null;
InputStreamReader isr =null;
String s;
try{
//new input stream reader is created
fis =newFileInputStream(filePath);
isr =newInputStreamReader(fis);
//the name of the character encoding returned
s=isr.getEncoding();
}catch(Exception e){
// print error
System.out.print("The stream is already closed"); }finally{
// closes the stream and releases resources associatedif(fis!=null)
fis.close();if(isr!=null)
isr.close(); } return s;
}You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.
The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.
Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.
Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.
Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.

c.get charset from file(sure)private String getCharsetByInputStream(InputStream ins){
String charset = "";
if(null != ins){
UniversalDetector detector = new UniversalDetector(null);
try {
byte[] buf = new byte[ins.available()];
int nread;

while ((nread = ins.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
} catch (IOException e) {
LOG.error("--getCharsetByInputStream:error happened while getting charset from inputstream. ",e);
charset = "utf-8";
return charset;
}
detector.dataEnd();
charset = detector.getDetectedCharset();
if (charset == null || "".equals(charset)) {
charset = "utf-8";
}

detector.reset();
}else{
charset = "utf-8";
}

return charset;
}link to http://code.google.com/p/juniversalchardet/
then read inputstream as string with detected charset
private String parseTruncaredSizeBinaryResourceToString(Integer resourceKey, Integer limitedSize){
String truncaredResourceText = null;
InputStream ins = proactiveAnalysisService.retrieveGlobalResourceBinary(resourceKey, null, limitedSize);
if (null != ins) {
String charset = getCharsetByInputStream(ins);
InputStreamReader reader = null;

try { //skip to beginning after get charset by inputStream(which leads to end of inputStream)
ins.reset();
//ins.skip(ins.available());
} catch (IOException e) {
LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
}

try {
reader = new InputStreamReader(ins, charset);
} catch (UnsupportedEncodingException e) {
LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
return truncaredResourceText;
}

OutputStream out = null;
try {
out = new ByteArrayOutputStream();
int i = -1;
while ((i = reader.read()) != -1) {
out.write(i);
}

truncaredResourceText = out.toString();
} catch (IOException e) {
LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend when reading inputsream. ", e);
} finally {
try{
if (null != out) {
out.close();
}
if (null != ins) {
ins.close();
}
if(null != reader){
reader.close();
}
}catch(IOException e){
LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend while close inputsream. ",e);
}

}
}
return truncaredResourceText;}
link to http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream
andhow to change index position of inputstream http://stackoverflow.com/questions/3474911/changing-the-index-positioning-in-inputstream

本文出自 “六度空间” 博客,请务必保留此出处http://jasonwalker.blog.51cto.com/7020143/1394395
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: