您的位置:首页 > 运维架构

hadoop 中文乱码问题解决

2013-03-09 22:22 295 查看
当做分析的时候有时候会接触到中文。因为hadoop默认都是UTF8格式。。读了一下源码,又结合网上的说法重新了一下FileOutPutFormat

其实hadoop原本在操作的时候一直操作的是 字节组的载入。问题出在我们手写部分 Text.toString() 方法,他被默认转换成UTF8格式的了。

public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {

protected static class LineRecordWriter<K, V>
extends RecordWriter<K, V> {
private static final String gbk = "gbk";
private static final byte[] newline;
static {
try {
newline = "\n".getBytes(gbk);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException("can't find " + gbk + " encoding");
}
}

protected DataOutputStream out;
private final byte[] keyValueSeparator;

public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
this.out = out;
try {
this.keyValueSeparator = keyValueSeparator.getBytes(gbk);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException("can't find " + gbk + " encoding");
}
}

public LineRecordWriter(DataOutputStream out) {
this(out, "\t");
}

/**
* Write the object to the byte stream, handling Text as a special
* case.
* @param o the object to print
* @throws IOException if the write throws, we pass it on
*/
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {

out.write(o.toString().getBytes(gbk));
}
}

public synchronized void write(K key, V value)
throws IOException {

boolean nullKey = key == null || key instanceof NullWritable;
boolean nullValue = value == null || value instanceof NullWritable;
if (nullKey && nullValue) {
return;
}
if (!nullKey) {
writeObject(key);
}
if (!(nullKey || nullValue)) {
out.write(keyValueSeparator);
}
if (!nullValue) {
writeObject(value);
}
out.write(newline);
}

public synchronized
void close(TaskAttemptContext context) throws IOException {
out.close();
}
}

public RecordWriter<K, V>
getRecordWriter(TaskAttemptContext job
) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get("mapred.textoutputformat.separator",
"\t");
CompressionCodec codec = null;
String extension = "";
if (isCompressed) {
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
if (!isCompressed) {
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
} else {
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(new DataOutputStream
(codec.createOutputStream(fileOut)),
keyValueSeparator);
}
}
}


手写的mapper 方法中要注意转换成GBK格式,以确保从头到尾都是GBK操作。

String str=new String(value.getBytes[],"GBK") ;

这样中文问题就解决了
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: