HBase是否适合存储Blob数据?
2011-07-18 11:03
459 查看
This is a tricky question - it depends on how large the blobs are, what is the read/write ratio and the overall sizing of the system (disk/memory relative to the read/write requirements).
Let me try to point out some obvious downsides (which may not matter depending on above):
HBase periodically compacts it's entire index on-disk. It does so to provide efficient random lookups against this index. Because the index and row data (blob in other words) are stored together (what is called an 'index-organized-table' in classic RDBMS lingo) - the compaction of index also requires the movement of row data. The more the size of the row/blob in proportion to the index itself - the more the cost of the compactions. If one is storing petabytes of data composed of large blobs (say multiple megabytes each) - then the compactions will become very expensive (and wasteful).
Committing a (large) blob to disk will be unnecessarily expensive (first the entire blob will be written once to the commit log and then it will be written out again to the actual HFile. (Each of those writes would itself be 3x replicated by hdfs and u might want to write to another data center too!)
If the blobs are immutable (for example image data) - then the cost of (HDFS) replication becomes very high (3x per data center). For immutable data - one can easily store one copy per data center (and fall back to other data centers if one copy goes missing). If u are storing large amount of data - this could affect cost a lot.
Finally - the index itself becomes more fragmented when large blobs are stored inline (HBase will try to compact it periodically - but if u are writing large amounts of data - there will always be some fragmentation). Which would imply some (hard to quantify modulo caching strategies) degradation in the performance of the index itself.
A simple solution is to use HBase to only store the index itself and store blobs elsewhere. As a naiive example - one could store the blobs as files in an external file system (could be HDFS or not) and store the name(s) of the file/offset in hbase (against the identifier for the blob). See for example Oracle documentation around storing LOBs inline or in external file systems. One of the main issues with this approach is the lack of atomicity in writing the blob data and updating the index (a problem Oracle takes care of internally). Usually not a big deal (since errors are infrequent) - but worth pointing out.
Let me try to point out some obvious downsides (which may not matter depending on above):
HBase periodically compacts it's entire index on-disk. It does so to provide efficient random lookups against this index. Because the index and row data (blob in other words) are stored together (what is called an 'index-organized-table' in classic RDBMS lingo) - the compaction of index also requires the movement of row data. The more the size of the row/blob in proportion to the index itself - the more the cost of the compactions. If one is storing petabytes of data composed of large blobs (say multiple megabytes each) - then the compactions will become very expensive (and wasteful).
Committing a (large) blob to disk will be unnecessarily expensive (first the entire blob will be written once to the commit log and then it will be written out again to the actual HFile. (Each of those writes would itself be 3x replicated by hdfs and u might want to write to another data center too!)
If the blobs are immutable (for example image data) - then the cost of (HDFS) replication becomes very high (3x per data center). For immutable data - one can easily store one copy per data center (and fall back to other data centers if one copy goes missing). If u are storing large amount of data - this could affect cost a lot.
Finally - the index itself becomes more fragmented when large blobs are stored inline (HBase will try to compact it periodically - but if u are writing large amounts of data - there will always be some fragmentation). Which would imply some (hard to quantify modulo caching strategies) degradation in the performance of the index itself.
A simple solution is to use HBase to only store the index itself and store blobs elsewhere. As a naiive example - one could store the blobs as files in an external file system (could be HDFS or not) and store the name(s) of the file/offset in hbase (against the identifier for the blob). See for example Oracle documentation around storing LOBs inline or in external file systems. One of the main issues with this approach is the lack of atomicity in writing the blob data and updating the index (a problem Oracle takes care of internally). Usually not a big deal (since errors are infrequent) - but worth pointing out.
相关文章推荐
- Hadoop Hbase适合存储哪类数据?
- Hadoop Hbase适合存储哪类数据?(转)
- Hadoop Hbase适合存储哪类数据?
- Hadoop Hbase适合存储哪类数据?
- 验证选择每日学习总结:DropDownList是否已选择验证、存储过程参数为sql字符串问题、将截断字符串或二进制数据。\r\n语句已终止
- 数据存储(大数据):Hbase概述、特点、应用场景、基本操作
- HBase 0.1.0 数据存储基本结构详解
- Sql存储过程中判断某个数据表的某一行的某列的值是否为NUll
- 使用SharedPreference来存储一个变量,用来记录是否已经导入了SQLite数据
- xutils3 对byte[] 、Blob类型数据的存储
- Android 数据存储——SQLite实例、判断数据库中表是否存在
- 每日学习总结:DropDownList是否已选择验证、存储过程参数为sql字符串问题、将截断字符串或二进制数据。\r\n语句已终止
- 在数据库中存储 BLOB 数据的缺点
- c++类的成员函数、数据成员存储方式(是否属于类的对象)
- 基于HBase的大数据存储的应用场景分析
- Blob数据类型在数据库中的存储
- MS SQL Server中数据表、视图、函数/方法、存储过程是否存在判断及创建
- hbase数据存储与查找
- Flume+HBase采集和存储日志数据
- Android 判断SD卡是否存在及容量查询 分类: Android数据存储 2014-06-20 13:32 66人阅读 评论(0) 收藏