Impala介绍(Cloudera Impala Real-Time Queries in Apache Hadoop, For Real)

· by Marcel Kornacker & Justin Erickson
· October 24, 2012

After a long period ofintense engineering effort and user feedback, we are very pleased, and proud,to announce the Cloudera Impala project. This technology is a revolutionary onefor Hadoop users, and we do not take that claim


When Google published its Dremel paper in 2010,
we were as inspired as the rest of the community by thetechnical vision to bring real-time, ad hoc query capability to Apache Hadoop,complementing traditional MapReduce batch processing. Today, we are announcinga fully functional, open-sourced codebase that
delivers on that vision – and,we believe, a bit more – which we call Cloudera Impala. An Impala binary is nowavailable in public beta form, but if you would prefer to test-drive Impala viaa pre-baked VM, we have one of those for you, too. (Links to all downloads
anddocumentation are here.) You can also review the source code and testing harness at Github right


Impala raises the bar forquery performance while retaining a familiar user experience. With Impala, youcan query data, whether stored in HDFS or Apache HBase – including SELECT,JOIN, and aggregate functions – in real time. Furthermore,
it uses the samemetadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax)as Apache Hive, providing a familiar and unified platform for batch-oriented orreal-time queries. (For that reason, Hive users can utilize Impala with littlesetup
overhead.) The first beta drop includes support for text files andSequenceFiles; SequenceFiles can be compressed as Snappy, GZIP, and BZIP (withSnappy recommended for maximum performance). Support for additional formatsincluding Avro, RCFile, LZO text files,
and Doug Cutting’s Trevni columnar format is planned for the production drop.

To avoid latency, Impalacircumvents MapReduce to directly access the data through a specializeddistributed query engine that is very similar to those found in commercialparallel RDBMSs. The result is order-of-magnitude faster
performance than Hive,depending on the type of query and configuration. (See FAQ below for moredetails.) Note that this performance improvement has been confirmed by severallarge companies that have tested Impala on real-world workloads for severalmonths now.


A high-level architecturalview is below:


There are many advantagesto this approach over alternative approaches for querying Hadoop data,including:


· Thanks to local processingon data nodes, network bottlenecks are avoided.
· A single, open, and unifiedmetadata store can be utilized.
· Costly data formatconversion is unnecessary and thus no overhead is incurred.
· All data is immediatelyquery-able, with no delays for ETL.
· All hardware is utilizedfor Impala queries as well as for MapReduce.
· Only a single machine poolis needed to scale.
We encourage you to readthe documentation for
further technical details.

Finally, we’d like toanswer some questions that we anticipate will be popular:



IsImpala open source? Impala开源吗?

Yes, Impala is 100% open source (Apache License). You can review the code foryourself at Github today.

Impala百分之百开源(Apache许可)。你可以在Github 查看代码。

Howis Impala different than Dremel? Impala和Dremel区别?

The first and principal difference is that Impala is open source and availablefor everyone to use, whereas Dremel is proprietary to Google.


Technically, Dremelachieves interactive response times over very large data sets through the useof two techniques:


· A novel columnar storageformat for nested relational data/data with nested structures
· Distributed scalableaggregation algorithms, which allow the results of a query to be computed onthousands of machines in parallel.
The latter is borrowed fromtechniques developed for parallel DBMSs, which also inspired the creation ofImpala. Unlike Dremel as described in the 2010paper,
which could only handle single-tablequeries, Impala already supports the full set of join operators that are one ofthe factors that make SQL so popular.


In order to realize thefull performance benefits demonstrated by Dremel, Hadoop will shortly have anefficient columnar binary storage format called Trevni.
But contrary to Dremel, Impala supports a range of popularfile formats. This lets users run Impala on their existing data without havingto “load” or transform it. It also lets users decide if they want to optimizefor flexibility or just pure performance.


To sum it up, Impala plusTrevni will achieve the query performance described in the Dremel paper, butsurpass what is described there in SQL functionality.


Howmuch faster are Impala queries than Hive ones, really? 实际中Impala能比Hive快多少?

The precise amount of performance improvement is highly dependent on a numberof factors:


· Hardware configuration:Impala is generally able to take full advantage of hardware resources andspecifically generates less CPU load than Hive, which often translates intohigher
observed aggregate I/O bandwidth than with Hive. Impala of course cannotgo faster than the hardware permits, so any hardware bottlenecks will limit theobserved speedup. For purely I/O bound queries, we typically see performancegains in the range of 3-4x.
硬件配置: Impala通常情况下可以利用硬件资源的所有优势。特别地,相对Hive,一般来说CPU负载更低,但经常导致更高的可观察到的总I/O带宽需求。Impala不可能超过硬件的限制,所以任何硬件的瓶颈都将限制可观察到的性能提升。对于单纯的I/O消耗的查询,典型的性能提升有3-4倍。
· Complexity of the query:Queries that require multiple MapReduce phases in Hive or require reduce-sidejoins will see a higher speedup than, say, simple single-table aggregationqueries.
For queries with at least one join, we have seem performance gains of7-45X.
· Availability of main memoryas a cache for table data: If the data accessed through the query comes out ofthe cache, the speedup will be more dramatic thanks to Impala’s
superiorefficiency. In those scenarios, we have seen speedups of 20x-90x over Hive evenon simple aggregation queries.
IsImpala a replacement for MapReduce or Hive – or for traditional data warehouseinfrastructure, for that matter? Impala是用来替换MapReduce或者Hive,还是用来替换传统数据仓库的基础设施?

No. There will continue be many viable use cases for MapReduce and Hive (forexample, for long-running data transformation workloads) as well as traditionaldata warehouse frameworks (for example, for complex analytics on limited,structured data sets). Impala
is a complement to those approaches, supportinguse cases where users need to interact with very large data sets, across alldata silos, to get focused result sets quickly.


Doesthe Impala Beta Release have any technical limitations? Impala beta版有技术限制吗?

As mentioned previously, supported file formats in the first beta drop includetext files and SequenceFiles, with many other formats to be supported in theupcoming production release. Furthermore, currently all joins are done in amemory space no larger than
that of the smallest node in the cluster; inproduction, joins will be done in aggregate memory. Lastly, no UDFs arepossible at this time.


Whatare the technical requirements for the Impala Beta Release? Impala beta版的环境要求?

You will need to have CDH4.1 installed on RHEL/CentOS 6.2.
We highly recommend the use of ClouderaManager(Free or EnterpriseEdition) to deploy
and manage Impala because it takes care of distributeddeployment and monitoring details automatically.

需要安装在RHEL/CentOS 6.2上的 CDH4.1。我们强烈建议使用 ClouderaManager(免费版或企业版)来部署和管理Impala,因为其可以自动进行分布式部署并对细节进行监控。

Whatis the support policy for the Impala Beta Release? Impala beta版的支持政策?

If you are an existing Cloudera customer with a bug, you may raise a Customer Support ticket
and we will attempt to resolve it on a best-effort basis.If you are not an existing Cloudera customer, you may use our public JIRA instanceor
the impala-user mailing list, which will be monitored by Cloudera employees.

向我们反映,我们将尽力尝试解决。如果你不是Cloudera的顾客,你可以通过public JIRA

Whenwill Impala be generally available for production use? Impala何时能够作为产品使用?

A production drop is planned for the first quarter of 2013. Customers mayobtain commercial support in the form of aClouderaEnterprise
subscription at that time.


We hope that you take theopportunity to review the Impala source code, explore the beta release,download and install the VM, or any combination of the above. Your feedback inall cases is appreciated; we need your help to make
Impala even better.


We will bring you furtherupdates about Impala as we get closer to production availability.



Impala source code

Impaladownloads (Beta Release and VM)


Public JIRA

Impala mailing list

- Free Impalatraining (Screencast)

Marcel Kornacker isthe architect of Impala. Prior to joining Cloudera, he was the lead developerfor the query engine of Google’s F1 project.

Justin Erickson isthe product manager for Impala.


