how to get charset from string and file
2014-04-11 22:10
531 查看
a.get charset from string
public String getCharsetFromString(String srcString) throws IOException {
BufferedInputStream bin = new BufferedInputStream(new ByteArrayInputStream(
srcString.getBytes()));
int p = (bin.read() << 8) + bin.read();
String code = null;
//the 0xefbb、0xfffe、0xfeff、0x5c75 at the beginning of each string, can be used to defines the char set
switch (p) {
case 0xefbb:
code = "UTF-8";
break;
case 0xfffe:
code = "Unicode";
break;
case 0xfeff:
code = "UTF-16BE";
break;
case 0x5c75:
code = "ANSI|ASCII";
break;
default:
code = "ISO-8859-1";
}
return code;
}
b.get charset from file(not sure)
public String getCharsetFromFile(String filePath)throwsIOException{
FileInputStream fis =null;
InputStreamReader isr =null;
String s;
try{
//new input stream reader is created
fis =newFileInputStream(filePath);
isr =newInputStreamReader(fis);
//the name of the character encoding returned
s=isr.getEncoding();
}catch(Exception e){
// print error
System.out.print("The stream is already closed"); }finally{
// closes the stream and releases resources associatedif(fis!=null)
fis.close();if(isr!=null)
isr.close(); } return s;
}You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.
The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.
Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.
Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.
Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.
c.get charset from file(sure)private String getCharsetByInputStream(InputStream ins){
String charset = "";
if(null != ins){
UniversalDetector detector = new UniversalDetector(null);
try {
byte[] buf = new byte[ins.available()];
int nread;
while ((nread = ins.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
} catch (IOException e) {
LOG.error("--getCharsetByInputStream:error happened while getting charset from inputstream. ",e);
charset = "utf-8";
return charset;
}
detector.dataEnd();
charset = detector.getDetectedCharset();
if (charset == null || "".equals(charset)) {
charset = "utf-8";
}
detector.reset();
}else{
charset = "utf-8";
}
return charset;
}link to http://code.google.com/p/juniversalchardet/
then read inputstream as string with detected charset
private String parseTruncaredSizeBinaryResourceToString(Integer resourceKey, Integer limitedSize){
String truncaredResourceText = null;
InputStream ins = proactiveAnalysisService.retrieveGlobalResourceBinary(resourceKey, null, limitedSize);
if (null != ins) {
String charset = getCharsetByInputStream(ins);
InputStreamReader reader = null;
try { //skip to beginning after get charset by inputStream(which leads to end of inputStream)
ins.reset();
//ins.skip(ins.available());
} catch (IOException e) {
LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
}
try {
reader = new InputStreamReader(ins, charset);
} catch (UnsupportedEncodingException e) {
LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
return truncaredResourceText;
}
OutputStream out = null;
try {
out = new ByteArrayOutputStream();
int i = -1;
while ((i = reader.read()) != -1) {
out.write(i);
}
truncaredResourceText = out.toString();
} catch (IOException e) {
LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend when reading inputsream. ", e);
} finally {
try{
if (null != out) {
out.close();
}
if (null != ins) {
ins.close();
}
if(null != reader){
reader.close();
}
}catch(IOException e){
LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend while close inputsream. ",e);
}
}
}
return truncaredResourceText;}
link to http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream
andhow to change index position of inputstream http://stackoverflow.com/questions/3474911/changing-the-index-positioning-in-inputstream
本文出自 “六度空间” 博客,请务必保留此出处http://jasonwalker.blog.51cto.com/7020143/1394395
public String getCharsetFromString(String srcString) throws IOException {
BufferedInputStream bin = new BufferedInputStream(new ByteArrayInputStream(
srcString.getBytes()));
int p = (bin.read() << 8) + bin.read();
String code = null;
//the 0xefbb、0xfffe、0xfeff、0x5c75 at the beginning of each string, can be used to defines the char set
switch (p) {
case 0xefbb:
code = "UTF-8";
break;
case 0xfffe:
code = "Unicode";
break;
case 0xfeff:
code = "UTF-16BE";
break;
case 0x5c75:
code = "ANSI|ASCII";
break;
default:
code = "ISO-8859-1";
}
return code;
}
b.get charset from file(not sure)
public String getCharsetFromFile(String filePath)throwsIOException{
FileInputStream fis =null;
InputStreamReader isr =null;
String s;
try{
//new input stream reader is created
fis =newFileInputStream(filePath);
isr =newInputStreamReader(fis);
//the name of the character encoding returned
s=isr.getEncoding();
}catch(Exception e){
// print error
System.out.print("The stream is already closed"); }finally{
// closes the stream and releases resources associatedif(fis!=null)
fis.close();if(isr!=null)
isr.close(); } return s;
}You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.
The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.
Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.
Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.
Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.
c.get charset from file(sure)private String getCharsetByInputStream(InputStream ins){
String charset = "";
if(null != ins){
UniversalDetector detector = new UniversalDetector(null);
try {
byte[] buf = new byte[ins.available()];
int nread;
while ((nread = ins.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
} catch (IOException e) {
LOG.error("--getCharsetByInputStream:error happened while getting charset from inputstream. ",e);
charset = "utf-8";
return charset;
}
detector.dataEnd();
charset = detector.getDetectedCharset();
if (charset == null || "".equals(charset)) {
charset = "utf-8";
}
detector.reset();
}else{
charset = "utf-8";
}
return charset;
}link to http://code.google.com/p/juniversalchardet/
then read inputstream as string with detected charset
private String parseTruncaredSizeBinaryResourceToString(Integer resourceKey, Integer limitedSize){
String truncaredResourceText = null;
InputStream ins = proactiveAnalysisService.retrieveGlobalResourceBinary(resourceKey, null, limitedSize);
if (null != ins) {
String charset = getCharsetByInputStream(ins);
InputStreamReader reader = null;
try { //skip to beginning after get charset by inputStream(which leads to end of inputStream)
ins.reset();
//ins.skip(ins.available());
} catch (IOException e) {
LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
}
try {
reader = new InputStreamReader(ins, charset);
} catch (UnsupportedEncodingException e) {
LOG.error("-- parseTruncaredSizeBinaryResourceToString:error happened while reading content from resource inputstream. ",e);
return truncaredResourceText;
}
OutputStream out = null;
try {
out = new ByteArrayOutputStream();
int i = -1;
while ((i = reader.read()) != -1) {
out.write(i);
}
truncaredResourceText = out.toString();
} catch (IOException e) {
LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend when reading inputsream. ", e);
} finally {
try{
if (null != out) {
out.close();
}
if (null != ins) {
ins.close();
}
if(null != reader){
reader.close();
}
}catch(IOException e){
LOG.error("-- parseTruncaredSizeBinaryResourceToString:Error hanppend while close inputsream. ",e);
}
}
}
return truncaredResourceText;}
link to http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream
andhow to change index position of inputstream http://stackoverflow.com/questions/3474911/changing-the-index-positioning-in-inputstream
本文出自 “六度空间” 博客,请务必保留此出处http://jasonwalker.blog.51cto.com/7020143/1394395
相关文章推荐
- How to get file extension from string in C++
- How to get trace file from getbfno.sql gettrcname.sql
- How to Use Oradebug to Get Trace File Name and Location
- ffmpeg示例一:how to use libavformat and libavcodec to read video from a file.
- How to use GET and POST methods in HTTP from a MIDlet
- How to get the Diagnostic data and debug file of Sales Order pick release
- How to get file from classpath
- A real and useful way to get/prase XML file from a site
- how to get file from classpath using jboss7.x.1 --reference
- How to get array from string contained identical symbols in Ruby?
- string - How to get the filename without the extension from a path in Python? - Stack Overflow
- How to Copy Archivelog Files From ASM to Filesystem and vice versa
- [cernRoot]How to get tree name from root file
- How to get array from string contained identical symbols in Ruby?
- Get json formatted string from web by sending HttpWebRequest and then deserialize it to get needed data
- How to get Chinese words or Western words from a string ?
- JSP URI/URL - How to get the request URI, URL, and Context from a JSP
- How to get rid of `deprecated conversion from string constant to 'char*'` warnings in GCC?
- [Android] how to get pem format public key from modulus and exponent
- how to get preloader and dsp_bl from mtk phone?