您的位置:首页 > 运维架构 > Apache

Apache CarbonData

2016-06-12 17:13 531 查看

Abstract

Apache CarbonData is a new Apache Hadoop native file format for faster interactive
columnar storage, index, compression and encoding techniquesquery using advanced
to improve computing efficiency, in turn it will help speedup queries an order of
magnitude faster over
PetaBytes of data.

https://github.com/HuaweiBigData/carbondata

Background

Support interactive OLAP-style query over big data in seconds.

Support fast query on individual record which require touching all fields.

Fast data loading speed and support incremental load in period of minutes.

Support HDFS so that customer can leverage existing Hadoop cluster.

Support time based data retention.

Rationale

CarbonData contains multiple modules, which are classified into two categories:

CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc.

CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime

Feature

Indexing

1. Multi-dimensional Key (B+ Tree index)

The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block
in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented
as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory.

2. Inverted index
Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible
when combining bitmap and inverted index in query time.

3. MinMax index
For all columns, minmax index is created so that processing/query engine can skip scan that is not required.

Global Dictionary

Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed
dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user.

Column Group

Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns
will be touched by the workload. To accelerate this,
CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval.

Optimized for multiple use cases

CarbonData indices and dictionary is highly configurable. To make storage optimized
for different use cases, user can configure what to index, so user can decide and tune the format before loading data into

CarbonData.

For example
Use Case
Supporting Features
Interactive OLAP query
 Columnar format, Multi-dimensional Key (B+ Tree index), Minmax index, Inverted index

High throughput scan
 Global dictionary, Minmax index 
 Low latency point query
 Multi-dimensional Key (B+ Tree index), Partitioning
Individual record query
Column group, Global dictionary

igData Processing Framework Integration

CarbonData provides

InputFormat/OutputFormat interfaces for Reading/Writing data from the
CarbonData files and at the same time provides abstract API for processing data stored as Carbondata format with data processing framework.

CarbonData provides deep integration with Apache Spark including predicate push down, column pruning, aggregation push down etc. So users can use
Spark SQL to connect and query from
CarbonData.

CarbonData can integrate with various big data Query/Processing framework on Hadoop eco-system such as Apache Spark,Apache Hive etc.

Example: https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: