eBay open sources a big, fast SQL-on-Hadoop database
2014-10-24 09:54
351 查看
eBay has open sourced a database technology, called Kylin, that takes advantage of distributed processing and the HBase data store in order to return faster results for SQL queries over Hadoop data.
Online auction site eBay has open sourced a database technology called Kylin that
the company says enables fast queries over even petabytes of data stored in Hadoop. eBay isn’t a big data user on par with companies like Google and Facebook, but it does
run technologies such as Hadoop at a fairly large scale and Kylin seems a good example of the type of innovation it’s doing on top of them.
eBay details
Kylin in a blog post on Wednesday, citing among other features its REST APIs, ANSI-SQL compatibility, connections to analysis tools Tableau and Excel, and sub-second latency on some
queries. However, the most unique features of Kylin involve how it deals with scale. eBay says it can query billions of rows of data — on datasets more that 14 terabytes in size — at speeds much faster than using the traditional Apache Hive tool.
The way Kylin works, at a high level, is to take data from Hive; pre-process large queries using MapReduce; and then store those results as key-value “cuboids” in HBase. When a user runs a Kylin query using a particular set of variables, the values are ready
to go without requiring them to be processed again. It’s not entirely dissimilar from the cubes than analytic databases have been utilizing for years, but Kylin’s cuboids are designed with HBase’s preferred data structure in mind.
Here’s how eBay says Kylin has is used within the company:
At the time of open-sourcing Kylin, we already had several eBay business units using it in production. Our largest use case is the analysis of 12+ billion source records generating 14+ TB cubes. Its 90% query latency is less than 5 seconds. Now, our use cases
target analysts and business users, who can access analytics and get results through the Tableau dashboard very easily – no more Hive query, shell command, and so on.
It would be interesting to know how Kylin stacks up against
next-generation versions of Hive, Spark SQL and other options for SQL analysis in Hadoop that have emerged as a result of the YARN resource manager available in the latest versions of Apache Hadoop. My guess is it’s slower but more scalable than in-memory
options or those not requiring MapReduce processing, but that it might be a solid option for the large percentage of Hadoop users still running earlier versions of the software.
Online auction site eBay has open sourced a database technology called Kylin that
the company says enables fast queries over even petabytes of data stored in Hadoop. eBay isn’t a big data user on par with companies like Google and Facebook, but it does
run technologies such as Hadoop at a fairly large scale and Kylin seems a good example of the type of innovation it’s doing on top of them.
eBay details
Kylin in a blog post on Wednesday, citing among other features its REST APIs, ANSI-SQL compatibility, connections to analysis tools Tableau and Excel, and sub-second latency on some
queries. However, the most unique features of Kylin involve how it deals with scale. eBay says it can query billions of rows of data — on datasets more that 14 terabytes in size — at speeds much faster than using the traditional Apache Hive tool.
The way Kylin works, at a high level, is to take data from Hive; pre-process large queries using MapReduce; and then store those results as key-value “cuboids” in HBase. When a user runs a Kylin query using a particular set of variables, the values are ready
to go without requiring them to be processed again. It’s not entirely dissimilar from the cubes than analytic databases have been utilizing for years, but Kylin’s cuboids are designed with HBase’s preferred data structure in mind.
Here’s how eBay says Kylin has is used within the company:
At the time of open-sourcing Kylin, we already had several eBay business units using it in production. Our largest use case is the analysis of 12+ billion source records generating 14+ TB cubes. Its 90% query latency is less than 5 seconds. Now, our use cases
target analysts and business users, who can access analytics and get results through the Tableau dashboard very easily – no more Hive query, shell command, and so on.
It would be interesting to know how Kylin stacks up against
next-generation versions of Hive, Spark SQL and other options for SQL analysis in Hadoop that have emerged as a result of the YARN resource manager available in the latest versions of Apache Hadoop. My guess is it’s slower but more scalable than in-memory
options or those not requiring MapReduce processing, but that it might be a solid option for the large percentage of Hadoop users still running earlier versions of the software.
相关文章推荐
- eBay开源了其大型、高速SQL-on-Hadoop数据库
- eBay开源了其大型、高速SQL-on-Hadoop数据库
- List all the Databases on a SQL Server
- sql 问题 select permission denied on object 'pb_userinfo',database 'Maching',owner'ado' 解决方法
- SCVMM和SQL分别建在不同服务器上报错:Error ID 319 during database creation on remote SQL Server
- Deploying OpenFire for IM (instant message) service (TCP/IP service) with database MySQL , client Spark on linux部署OpenFire IM 消息中间件服务
- [ICME 2014, paperId 293]AN APPROACH FOR FAST AND PARALLEL VIDEO PROCESSING ON APACHE HADOOP CLUSTERS
- List all the Databases on a SQL Server
- Facebook open sources Corona — a better way to do webscale Hadoop
- How to Kill All Processes That Have Open Connection in a SQL Server Database[关闭数据库链接 最佳方法] -摘自网络
- SharePoint 2010 Form Authentication (SQL) based on existing database
- How to refresh database parallelly on sql server
- Issue 71 - pymssql - Undefined symbols on Mac, CentOS, Redhat with pre-compiled build - A fast MS SQL Server client library for Python directly using C API instead of ODBC. It is Python DB-API 2.0 compliant. Works on Linux, *BSD, Solaris, Mac OS X and Win
- How to grant access to SQL logins on a standby database when the guest user is disabled in SQL Serve
- Create a SQL Server Database on a network shared drive
- Connecting to databases like Mysql, SQL Server or Oracle on J2ME devices
- backup and restore database on Microsoft SQL Server 2005
- fast incremental backup failed on standby database
- SharePoint 2010 Form Authentication (SQL) based on existing database
- RAC database open,但是sqlplus进去,提示Connected to an idle instance