您的位置:首页
How-to: Do Statistical Analysis with Impala and R
2018-05-24 16:04
746 查看
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程)
https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share
![](https://oscdn.geek-share.com/Uploads/Images/Content/201808/1ccb5189f9677602c26e25142f0d9aef.png)
http://blog.cloudera.com/blog/2013/12/how-to-do-statistical-analysis-with-impala-and-r/
The new RImpala package brings the speed and interactivity of Impala to queries from R.
Our thanks to Austin Chungath, Sachin Sudarshana, and Vikas Raguttahalli of Mu Sigma, a Decision Sciences and Big Data analytics company, for the guest post below.
As is well known, Apache Hadoop traditionally relies on the MapReduce paradigm for parallel processing, which is an excellent programming model for batch-oriented workloads. But when ad hoc, interactive querying is required, the batch model fails to meet performance expectations due to its inherent latency.
To overcome this drawback, Cloudera introduced Cloudera Impala, the open source distributed SQL query engine for Hadoop data. Impala brings the necessary speed to queries that were otherwise not interactive when executed by the batch Apache Hive engine; Hive queries that used to take minutes can be executed in a matter of seconds using Impala.
Impala is quite exciting for us at Mu Sigma because existing Hive queries can run interactively with few or no changes. Furthermore, because we do a lot of our statistical computing on R, the popular open source statistical computing language, we considered it worthwhile to bring the speed of Impala to R.
To meet that goal, we have created a new R package, RImpala, which connects Impala to R. RImpala enables querying the data residing in HDFS and Apache HBase from R, which can be further processed as an R object using R functions. RImpala is now available for download from the Comprehensive R Archive Network (CRAN) under GNU General Public License (GPL3).
The RImpala architecture is simple: we used the existing Impala JDBC drivers and wrote a Java program to connect and query Impala, which we then called from R using the rJava package. We put them all together in an R package that you can use to easily query Impala from R.
![](https://oscdn.geek-share.com/Uploads/Images/Content/201805/40cb35f4971b045d96973fbd9ebc8e52.png)
Steps for Installing RImpala
Assuming that you have R and Impala already installed, installing the RImpala package is straightforward and is done in a manner similar to any other R package. There are two steps to installing RImpala and getting it working:Step 1: Install the package from CRAN
You can install RImpala directly using the install.packages() command in R.
1 | > install.packages("RImpala") |
R CMD INSTALLcommand:
1 | R CMD install RImpala_0.1.1.tar.gz |
You need to install Cloudera’s JDBC drivers before you can use the RImpala package that we installed earlier. Cloudera provides JBDC jars on its website that you can download directly. As of this writing, this is the link to zip file containing the JDBC jars.
There are two ways to do this:
If you have Impala installed on the machine running R, then you will have the necessary JDBC jars already (probably in /usr/lib/impala/lib) and you can use them to initiate the connection to Impala.
If the machine running R is a different server than the Impala server, then you need to download the JDBC jars from the above link and extract it to a location that can be accessed by the R user.
After you have installed the JDBC drivers you can start using the RImpala package:
Load the library.
1 | library(RImpala) |
1 | rimpala.init("/path/to/impala/jars") |
1 | rimpala.connect("IP or Hostname of Impala server", "port") |
1 2 3 | library(RImpala) rimpala.init(libs="/tmp/impala/jars/") rimpala.connect("192.168.10.1","21050") |
IP of the server running impalad service = 192.168.10.1
Port where the impalad service is listening = 21050
The default parameter for the rimpala.init() function is “/usr/lib/impala/lib” and the default parameters for rimpala.connect() function are “localhost” and “21050” respectively.
To run a query on the impalad instance that the client has connected, you can use the rimpala.query() function. Example:
1 | result |
You can also install the RImpala package on a client machine running Microsoft Windows. Since the JDBC jars are platform independent, you can extract them into a folder on a Windows machine (such as “C:\Program Files\impala”) and then this location can be passed as a parameter to the rimpala.init() function.
The following a simple example that shows you how to use RImpala:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | > library(RImpala) Loading required package: rJava > rimpala.init(libs="/tmp/impala/jars/") # Adds the impala JDBC jars present in the "/tmp/impala/jars/" folder to the classpath [1] "Classpath added successfully" > rimpala.connect(IP="192.168.10.1",port="21050") # Establishes a connection to impalad instance running on the machine 172.25.1.151 on the port 21050 [1] TRUE > rimpala.invalidate() # Invalidates the metadata of all the tables present in the Hive metastore [1] TRUE > rimpala.showdatabases()# Displays all the databases available # Output # name 1 airlines 2 bank 3 default > rimpala.usedatabase("bank") # Changes the current database to "bank" Database changed to bank [1] TRUE > rimpala.showtables() # Displays all the tables present in the current database # Output # name 1 bank_web_clicks 2 ticker_100m 3 stock_1gb 4 weblog_10gb > rimpala.describe("bank_web_clicks") # Describes the table "bank_web_clicks" # Output # Name type comment 1 customer_id int Customer ID 2 session_id int Session ID 3 page string Web page name 4 datestamp timestamp Date > result result # Output # customer_id session_id cnt 1 32 21 5200 2 34 12 5100 3 35 49 4105 4 32 34 3600 5 36 32 3218 6 37 67 3190 7 31 45 2990 8 35 75 2300 9 34 69 2113 > rimpala.close() # Closes the connection to the impalad instance [1] TRUE |
Conclusion
Impala is an exciting new technology that is gaining popularity and will probably grow to be an enterprise asset in the Hadoop world. We hope that RImpala will be a fruitful package for all Big Data analysts to leverage the power of Impala from R.Impala is an ongoing and thriving effort at Cloudera and will continue to evolve with richer functionality and improved performance – and so will RImpala. We will continue to improve the package over time and incorporate new features into RImpala as and when they are made available in Impala.
Austin Chungath is a Senior Research Analyst with Mu Sigma’s Innovation & Development Team and maintainer of the RImpala project. He does research on various tools in the Hadoop ecosystem and the possibilities that they bring for analytics. He spends his free time contributing to Open Source projects like Apache Tez or building small robots.
Sachin Sudarshana is a Research Analyst with Mu Sigma’s Innovation & Development Team. His responsibilities include researching emerging tools in the Hadoop ecosystem and how they can be leveraged in an analytics context.
Vikas Raguttahalli is a Research Lead with Mu Sigma’s Innovation & Development Team. He is responsible for working with client delivery teams and helping clients institutionalize Big Data within their organizations, as well as researching new and upcoming Big Data tools. His expertise includes R, MapReduce, Hive, Pig, Mahout and the wider Hadoop ecosystem.
python风控评分卡建模和风控常识(博客主亲自录制视频教程)
https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share![](https://oscdn.geek-share.com/Uploads/Images/Content/201808/9df48e709ab5525f7f021c3bc11eb9ab.png)
相关文章推荐
- What is the PPA and How to do with it ?
- How to Do Everything with PHP and MySQL
- 3 ways to do WCF Concurrency Management(Single, Multiple, and Reentrant and How to do with Throttling)
- How to collect stats of apps from CloudFoundry and do analysis?
- How to make a level and XP system with unity 5
- TensorFlow: How to freeze a model and serve it with a python API
- How do disable paging by swiping with finger in ViewPager but still be able to swipe programmatically?
- [转]How to get return values and output values from a stored procedure with EF Core?
- How To Quickly Set Up Ubuntu 8.04 loaded with Erlang, Mochiweb and Nginx
- Visual C++ Debugging: How to use 'ASSERT' and deal with assertions failures?
- Correct Smartphone Video Orientation and How To Rotate iOS and Android Videos with ffmpeg
- How do I use locales and resource bundles to internationalize my application?
- How to debug an Angular 2 application with Chrome and VS Code
- How to build Multi-Language Web Sites with ASP.NET 2.0 and VS.Net 2005
- how to understand TSs – S1 handover with MME and SGW relocation and Indirect Tunneling
- How to compile Tensorflow with SSE4.2 and AVX instructions?
- How to Do Everything with Your Web 2.0 Blog (Paperback), Oct.2007.eBook-BBL
- how to do with the special characters in the xml data
- How to Train a ChatBot with the TensorFlow and Google Cloud ML
- raywenderlich—Harder Monsters and More Levels: How To Make A Simple iPhone Game with Cocos2D Part 3