您的位置:首页 > 产品设计 > UI/UE

hadoop SequenceFile介绍 大数据 存储

2015-08-14 16:32 781 查看
SequenceFile是一个由二进制序列化过的key/value的字节流组成的文本存储文件。

基于压缩类型CompressType,共有三种
SequenceFile
Writer:


There are three
SequenceFile
Writer
s
based on the CompressType
 used
to compress key/value pairs:

1、Writer
: Uncompressed records.

在SequenceFile里有Writer的实现,源码如下:

首先SequenceFile文件有个表头,以上是写表头的代码,我们以一个实例结合代码来看一下表头的组成。


存入的文本文件内容如下:

key value

100 One, two, buckle my shoe

99 Three, four, shut the door

98 Five, six, pick up sticks

97 Seven, eight, lay them straight

96 Nine, ten, a big fat hen

95 One, two, buckle my shoe

94 Three, four, shut the door

93 Five, six, pick up sticks

92 Seven, eight, lay them straight

91 Nine, ten, a big fat hen

90 One, two, buckle my shoe

89 Three, four, shut the door

88 Five, six, pick up sticks

87 Seven, eight, lay them straight

86 Nine, ten, a big fat hen

85 One, two, buckle my shoe

84 Three, four, shut the door

83 Five, six, pick up sticks

82 Seven, eight, lay them straight

81 Nine, ten, a big fat hen

80 One, two, buckle my shoe

79 Three, four, shut the door

78 Five, six, pick up sticks

77 Seven, eight, lay them straight

76 Nine, ten, a big fat hen

75 One, two, buckle my shoe

74 Three, four, shut the door

73 Five, six, pick up sticks

72 Seven, eight, lay them straight

71 Nine, ten, a big fat hen

70 One, two, buckle my shoe

69 Three, four, shut the door

68 Five, six, pick up sticks

67 Seven, eight, lay them straight

66 Nine, ten, a big fat hen

65 One, two, buckle my shoe

64 Three, four, shut the door

63 Five, six, pick up sticks

62 Seven, eight, lay them straight

61 Nine, ten, a big fat hen

60 One, two, buckle my shoe

59 Three, four, shut the door

58 Five, six, pick up sticks

57 Seven, eight, lay them straight

56 Nine, ten, a big fat hen

55 One, two, buckle my shoe

54 Three, four, shut the door

53 Five, six, pick up sticks

52 Seven, eight, lay them straight

51 Nine, ten, a big fat hen

50 One, two, buckle my shoe

49 Three, four, shut the door

48 Five, six, pick up sticks

47 Seven, eight, lay them straight

46 Nine, ten, a big fat hen

45 One, two, buckle my shoe

44 Three, four, shut the door

43 Five, six, pick up sticks

42 Seven, eight, lay them straight

41 Nine, ten, a big fat hen

40 One, two, buckle my shoe

39 Three, four, shut the door

38 Five, six, pick up sticks

37 Seven, eight, lay them straight

36 Nine, ten, a big fat hen

35 One, two, buckle my shoe

34 Three, four, shut the door

33 Five, six, pick up sticks

32 Seven, eight, lay them straight

31 Nine, ten, a big fat hen

30 One, two, buckle my shoe

29 Three, four, shut the door

28 Five, six, pick up sticks

27 Seven, eight, lay them straight

26 Nine, ten, a big fat hen

25 One, two, buckle my shoe

24 Three, four, shut the door

23 Five, six, pick up sticks

22 Seven, eight, lay them straight

21 Nine, ten, a big fat hen

20 One, two, buckle my shoe

19 Three, four, shut the door

18 Five, six, pick up sticks

17 Seven, eight, lay them straight

16 Nine, ten, a big fat hen

15 One, two, buckle my shoe

14 Three, four, shut the door

13 Five, six, pick up sticks

12 Seven, eight, lay them straight

11 Nine, ten, a big fat hen

10 One, two, buckle my shoe

9 Three, four, shut the door

8 Five, six, pick up sticks

7 Seven, eight, lay them straight

6 Nine, ten, a big fat hen

5 One, two, buckle my shoe

4 Three, four, shut the door

3 Five, six, pick up sticks

2 Seven, eight, lay them straight

1 Nine, ten, a big fat hen

把以上java 打成jar,执行hadoop jar sfile.jar /test/numbers.seq 1 1代表不压缩

hadoop fs -get hdfs:///test/numbers1.seq /usr/test/ 取出刚刚生成的文件,我们用UE打开看一下





0x53 0x45 0x51

这是SequenceFile Format的magic header「SEQ」,用来区别文本是否是「SequenceFile Format」。

0x06

版本编号,目前最新版为「SEQ6」

以上代码源码如下:

0x20
0x65........0x65

这部份是keyClassName(Key的类名),而第1个Byte(0x20)用来表示此字串的长度,此范例为org.apache.hadoop.io.IntWritable

以上代码源码如下:

0x19
0x6F ..... 0x74

这部份是valueClassName(Value的类名),第1个Byte(0x19)也是用来表示此字串的长度,此范例为org.apache.hadoop.io.BytesWritable

0x00

是否compression?「0x00」=否 (此为Boolean所以佔1个Byte)

0x00

是否blockCompression?「0x00」=否(此为Boolean所以佔1个Byte)

如果是压缩的话接下去还会写压缩类,此范例没有压缩所以没有此类名的写入,源码如下:

0x00 0x00 0x00 0x00

metadata,此范例沒有包含任何metadata, 所以输出「0x00 0x00 0x00 0x00」

0x76 0x61
..... 0xAF

sync标记,用来表示一个「Header」的结束。

至此头部文件写入完毕,可见头部的格式可以归纳如下:

SequenceFile Header


version - 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)

keyClassName -key class

valueClassName - value class

compression - A boolean which specifies if compression is turned on for keys/values in this file.

blockCompression - A boolean which specifies if block-compression is turned on for keys/values in this file.

compression codec -
CompressionCodec
class which is used for compression of keys and/or values (if compression is enabled).

metadata -
Metadata
for this file.

sync - A sync marker to denote end of the header.

接下去我们看一下数据存储格式:

0x00 0x00 0x00 0x1D

整个Record的size ,一个Record包含「Key、Value」。此处为29个字节,因为 key=100 占4个字节,value=One, two, buckle my shoe 占24字节 还有一个字节存了value的长度 所以
29字节总共。

0x00 0x00 0x00 0x04

Key內容的size~ (此为Int所以占4个字节)

0x00 0x00 0x00 0x64

Key的内容,此处为100,那十六进制就是64

0x18

value内容的size,此处是One, two, buckle my shoe,长度24 所以十六进制就是18

0x4F 0X6E....0x65

value的内容One, two, buckle my shoe

以上代码源代码如下:

当数据达到一个阀值,会写sync,写sync就是调用checkAndWriteSync();
源码如下:

SYNC_INTERVAL定义如下:

可见每2000byte会写一个sync

SequeceFile 无压缩图示如下:





可见格式如下:


Uncompressed SequenceFile Format


Header

Record

Record length

Key length

Key

Value

接下去我们看一下RecordCompressWriter

2、RecordCompressWriter
: Record-compressed files, only compress values.

执行hadoop jar sfile.jar /test/numbers2.seq 2 2代表使用RecordCompressWriter压缩

同样我们用UE看一下生成的文件:





可见表头和不压缩的基本一致,有些小区别如下:

0x01

代表使用了压缩

0x2A 0x6F ....0x63

使用的压缩类org.apache.hadoop.io.compress.DefaultCodec

0x00 ...0x25

整个Record的size ,为37,为啥比不压缩占用的字节更多?

0x00 0x00 0x00 0x04

Key內容的size~ (此为Int所以占4个字节)

0x00 0x00 0x00 0x64

Key的内容,此处为100,那十六进制就是64 可见key没有压缩

以上源代码如下:

SequeceFile 压缩value图示如下:





可见格式如下:

Record-Compressed SequenceFile Format


Header

Record

Record length

Key length

Key

Compressed Value

接下去我们看一下BlockCompressWriter

3、BlockCompressWriter
: Block-compressed files, both keys & values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

执行hadoop jar sfile.jar /test/numbers3.seq 3 3代表使用BlockCompressWriter压缩

同样我们用UE看一下生成的文件:





可见表头和不压缩的基本一致,有些小区别如下:

0x01

代表使用了压缩

0x01

代表使用了block压缩

对于block压缩的源代码如下:

其中,可见没调用一次append,就会相应累加keyLenBuffer和valLenBuffer的长度

WritableUtils.writeVInt(keyLenBuffer, keyLength);

WritableUtils.writeVInt(valLenBuffer, valLength);

int currentBlockSize = keyBuffer.getLength() + valBuffer.getLength();

if (currentBlockSize >= compressionBlockSize) {

sync();

}

compressionBlockSize = conf.getInt("io.seqfile.compress.blocksize", 1000000);

当超过compressionBlockSize是会调用sync,我们来看看sync的源码,如下:

Sequenc File ,块压缩图示如下:





可见格式如下:

Block-Compressed SequenceFile Format


Header

Record Block

Uncompressed number of records in the block

Compressed key-lengths block-size

Compressed key-lengths block

Compressed keys block-size

Compressed keys block

Compressed value-lengths block-size

Compressed value-lengths block

Compressed values block-size

Compressed values block

A sync-marker every block.

本文出自 “软件开发” 博客,请务必保留此出处http://tangjj.blog.51cto.com/1848040/1532915
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: