您的位置：首页 > 其它

简单自动获取文件编码

2012-09-19 15:55 281 查看

前段时间，在文章中用到读取文件，由于文件的编码不同，需要在程序中不断的调整读取文件的编码格式。

BufferedReader reader = newBufferedReader(newInputStreamReader(new FileInputStream(new File(文件名)),编码格式));

         在网上找了一些资料，对他们总结一下，以备以后需要用的时候能够方便查找。资料整理如下：

        Unicode：　　　　　　前两个字节为FFFE；

        Unicodebig endian：　前两字节为FEFF；　

         UTF-8：　　　　　　　前两字节为EFBB；　

        方法一.由于主要用到的编码格式是UTF-8和GBK的，所以很多时候只需要做如下的判断：

File file = new File(文件名);

         InputStreamios = null;

         byte[] b =new byte[3];

         ios = newFileInputStream(file);

         ios.read(b);

         ios.close();

         Stringencode;

         if (b[0]== -17 && b[1] == -69 && b[2] ==-65) { // 文件头

                  encode="UTF-8";

                   System.out.println(file.getName()+ "：编码为UTF-8");

         } else {

                  encode="GBK";

                    System.out.println(file.getName()+"：可能是GBK，也可能是其他编码。");

         }

方法二.见http://www.cppblog.com/biao/archive/2009/11/04/100130.aspx
         这个比较的详细一些，基本方法一能识别的，这个方法也能识别
         public static String get_charset(File file) {
                      Stringcharset = "GBK";
                  byte[]first3Bytes = new byte[3];
                  try {
                      boolean checked = false;
                            BufferedInputStreambis = new BufferedInputStream( new FileInputStream(file));
                            bis.mark(0);
                              int read = bis.read(first3Bytes, 0,3);
                              if (read == -1)
                                         return charset;
                              if (first3Bytes[0] == (byte) 0xFF&& first3Bytes[1] == (byte)0xFE)
{
                                     charset= "UTF-16LE";
                                     checked = true;
                            } else if (first3Bytes[0] == (byte)0xFE &&
first3Bytes[1] == (byte)0xFF) {
                                     charset = "UTF-16BE";
                                     checked = true;
                            } else if (first3Bytes[0] == (byte)0xEF
&& first3Bytes[1] == (byte)0xBB&& first3Bytes[2] == (byte)0xBF) {
                                     charset = "UTF-8";
                                     checked = true;
                            }
                            bis.reset();
                            if (!checked) {
                                        int loc = 0;
                                        while ((read = bis.read()) != -1) {
                                   loc++;
                                       if (read >= 0xF0)
                                                  break;
                                                  if (0x80 <= read && read<= 0xBF) // 单独出现BF以下的，也算是GBK
                                                  break;
                                       if (0xC0 <= read && read<= 0xDF) {
                                                   read = bis.read();
                                                 if (0x80 <= read && read<= 0xBF) // 双字节 (0xC0-
0xDF)
                                                    // (0x80
                                                     // -0xBF),也可能在GB编码内
                                               continue;
                                                 else
                                                          break;
                                     } else if (0xE0 <= read && read<= 0xEF) {// 也有可能出错，但是几率较小
                                                        read = bis.read();
                                                   if (0x80 <= read && read<= 0xBF) {
                                                                     read = bis.read();
                                                                     if (0x80 <= read && read<= 0xBF) {
                                                                           charset = "UTF-8";
                                                                              break;
                                                        } else
                                                                               break;
                                                                 } else
                                                                                        break;
                                                          }
                                     }
                      //System.out.println( loc +"
" + Integer.toHexString( read )
                            }
                            bis.close();
                            } catch (Exception e) {
                              e.printStackTrace();
                            }
                     return charset;
           }

方法三.参考http://blog.sina.com.cn/s/blog_904e7b150100zvcv.html

         这个方法需要用到cpdetector的一个jar包，我用的是cpdetector_1.0.10.jar，用这个包还需要导入antlr和chardet两个jar包

         这个包用在大部分情况下识别基本准确，但是测试GBK编码的识别不出来（可能我测试的不够准确）

         public static String
getFileEncode(File file) {

                   CodepageDetectorProxydetector = CodepageDetectorProxy.getInstance();

                   //下面可以添加集中识别编码的

                  detector.add(new ParsingDetector(false));

                    detector.add(JChardetFacade.getInstance());

                   detector.add(ASCIIDetector.getInstance());

                     detector.add(UnicodeDetector.getInstance());

                    Charset
charSet = null;

                    try {

                       charSet = detector.detectCodepage(file.toURI().toURL());

                     } catch (MalformedURLException
e) {

                              // TODO Auto-generatedcatchblock

                      e.printStackTrace();

                    } catch (IOException
e) {

                             // TODO Auto-generatedcatchblock

                    e.printStackTrace();

                     }

                     if (charSet
!= null){

                                return charSet.name();

                    } else {

                       return null;

                    }

           }

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： byte file exception jar 测试 null

相关文章推荐

新的分享

章节导航