Apache CarbonData
2016-06-12 17:13
531 查看
Abstract
Apache CarbonData is a new Apache Hadoop native file format for faster interactivecolumnar storage, index, compression and encoding techniquesquery using advanced
to improve computing efficiency, in turn it will help speedup queries an order of
magnitude faster over
PetaBytes of data.
https://github.com/HuaweiBigData/carbondata
Background
Support interactive OLAP-style query over big data in seconds.Support fast query on individual record which require touching all fields.
Fast data loading speed and support incremental load in period of minutes.
Support HDFS so that customer can leverage existing Hadoop cluster.
Support time based data retention.
Rationale
CarbonData contains multiple modules, which are classified into two categories:CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.
CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime
Feature
Indexing
1. Multi-dimensional Key (B+ Tree index)The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block
in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented
as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.
2. Inverted index
Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible
when combining bitmap and inverted index in query time.
3. MinMax index
For all columns, minmax index is created so that processing/query engine can skip scan that is not required.
Global Dictionary
Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observeddramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.
Column Group
Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columnswill be touched by the workload. To accelerate this,
CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.
Optimized for multiple use cases
CarbonData indices and dictionary is highly configurable. To make storage optimizedfor different use cases, user can configure what to index, so user can decide and tune the format before loading data into
CarbonData.
For example
Use Case | Supporting Features |
Interactive OLAP query | Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index |
High throughput scan | Global dictionary, Minmax index |
Low latency point query | Multi-dimensional Key (B+ Tree index), Partitioning |
Individual record query | Column group, Global dictionary |
igData Processing Framework Integration
CarbonData providesInputFormat/OutputFormat interfaces for Reading/Writing data from the
CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.
CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use
Spark SQL to connect and query from
CarbonData.
CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.
Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
相关文章推荐
- mac系统中搭建apache+mysql+php的开发环境,安装mysql后,登录报错:mac ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)
- apache 日志轮询 linux cronolog
- Apache 屏蔽YisouSpider(一搜蜘蛛\神马)的方法
- ssh与ajax结合之json异常:org.apache.struts2.json.JSONException
- Saltstack–配置管理安装apache
- at org.apache.jsp._404_jsp._jspInit(_404_jsp.java:22)
- Windows下搭建PHP环境:Apache+PHP+MySQL
- 在apache连接多php的时候遇到了问题,怎么切换多个php版本?
- Apache 配置虚拟主机三种方式
- JAVA利用Apache Poi写Excel文件
- apache Setting on ubuntu
- Ubuntu下安装php7后无法启动Apache
- bug:ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
- Python:统计Apache进程占用的物理内存比
- Ubuntu 14.04 Web服务器,Apache的安装和配置
- Apache的对象池化工具commons-pool
- JAVA利用Apache Poi读取Excel文件
- apache 虚拟主机配置
- apache 开启 url_rewrite
- apache+php环境,时遇到php5ts.dll错误