Hawq学习笔记 --- How to access HDFS data via GPDB external table with gphdfs protocol
2016-09-28 19:36
706 查看
Environment
Greenplum database 4.2.x, 4.3.x
Pivotal HD 1.x, 2.x
Prerequisites
Download jdk (1.7 is recommended) package and install it on all servers of the GPDB cluster
Download Pivotal HD installation
package (same version as that of PHD cluster to be access) and put it on GPDB master host
Install required PHD packages on GPDB cluster
Option 1
1. Unpack PHD installation tallball. Here take PHD-2.0.1 release as an example
And find out the following rpm packages in the unpacked directory
2. Install the rpm packages (follow the order listed above to avoid dependency check error). This needs to be done on all segment servers of GPDB cluster
3. Configure hadoop configuration files & ensure that hdfs works.
Option 2
1. If there is admin node (where Pivotal Commander Center is running) available on the target PHD cluster, then the required rpm packages could be found under /usr/lib/gphd/rpms
on admin node.
2. Install the rpm packages on all segment hosts of GPDB cluster through either of the following ways.
a) Copy those rpm packages to each segment server and install them with "rpm -ivh" command manually
b) Add a repo file (like gphd.repo) under /etc/yum.repos.d on every segment server with the content below. Note that "admin.hadoop.local" is the hostname of admin node on your site, which needs
to be modified accordingly.
3. Configure hadoop configuration files & ensure that hdfs works.
Then run "yum install <rpm package name>" to complete installation.
GPDB Configuration
1. Set environment variable JAVA_HOME for gpadmin user correctly on all segment servers, as illustrated below. Better to set it in .bashrc or .bash_profile
2. Set the parameters for GPDB
Run "gpstop -u" to take the change into effect.
Test
1. Check HDFS is accessible from any of the segment servers
2. Create a temporary text file and put it to HDFS
3. Create a readable external table in GPDB pointing it to sample file (test1.txt) in HDFS
NOTE: The blow example location field is for a single namenode deployment. In the case of two Namenodes (High Availability ) location field would be as followed "gphdfs:///tmp/test1.txt". We do not include a port and replace
hostname with the name configured in core-site.xml for param "fs.defaultFS"
4. Try query data from the external table
5. Create a writable external table in GPDB pointing to a file in HDFS
6. Insert data to the writable external table
7. Check existence and content of file in HDFS
Greenplum database 4.2.x, 4.3.x
Pivotal HD 1.x, 2.x
Prerequisites
Download jdk (1.7 is recommended) package and install it on all servers of the GPDB cluster
Download Pivotal HD installation
package (same version as that of PHD cluster to be access) and put it on GPDB master host
Install required PHD packages on GPDB cluster
Option 1
1. Unpack PHD installation tallball. Here take PHD-2.0.1 release as an example
[root@admin phd201]# tar xvfz PHD-2.0.1.0-148.tar.gz ... ...
And find out the following rpm packages in the unpacked directory
utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
2. Install the rpm packages (follow the order listed above to avoid dependency check error). This needs to be done on all segment servers of GPDB cluster
[root@admin phd201]# cd PHD-2.0.1.0-148 [root@admin PHD-2.0.1.0-148]#rpm -ivh \ utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm \ utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm \ zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm \ hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \ hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \ hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \ hadoop/rpm/hadoop-hdfs-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm
3. Configure hadoop configuration files & ensure that hdfs works.
Option 2
1. If there is admin node (where Pivotal Commander Center is running) available on the target PHD cluster, then the required rpm packages could be found under /usr/lib/gphd/rpms
on admin node.
2. Install the rpm packages on all segment hosts of GPDB cluster through either of the following ways.
a) Copy those rpm packages to each segment server and install them with "rpm -ivh" command manually
b) Add a repo file (like gphd.repo) under /etc/yum.repos.d on every segment server with the content below. Note that "admin.hadoop.local" is the hostname of admin node on your site, which needs
to be modified accordingly.
[gphd] name=PHD Admin Node Repo baseurl=http://admin.hadoop.local/gphd_yum_repo enabled=1 gpgcheck=0 metadata_expire=0
3. Configure hadoop configuration files & ensure that hdfs works.
Then run "yum install <rpm package name>" to complete installation.
GPDB Configuration
1. Set environment variable JAVA_HOME for gpadmin user correctly on all segment servers, as illustrated below. Better to set it in .bashrc or .bash_profile
[gpadmin@admin ~]$echo $JAVA_HOME /usr/java/default [gpadmin@admin ~]$ls -l /usr/java/default lrwxrwxrwx 1 root root 16 Jul 18 2013 /usr/java/default -> /usr/java/latest [gpadmin@admin ~]$ls -l /usr/java/latest lrwxrwxrwx 1 root root 21 Dec 15 2013 /usr/java/latest -> /usr/java/jdk1.7.0_25 [gpadmin@admin ~]$cat .bash_profile | grep JAVA_HOME export JAVA_HOME=/usr/java/default
2. Set the parameters for GPDB
[gpadmin@admin ~]$ gpconfig -c gp_hadoop_home -v "'/usr/lib/gphd'" [gpadmin@admin ~]$ gpconfig -c gp_hadoop_target_version -v "'gphd-2.0'"
Run "gpstop -u" to take the change into effect.
Test
1. Check HDFS is accessible from any of the segment servers
[gpadmin@sdw1 ~]$hdfs dfs -ls hdfs://hdm2:8020/ Found 7 items drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/apps drwxr-xr-x - postgres gpadmin 0 2014-06-16 04:55 hdfs://hdm2:8020/hawq_data drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/hive drwxr-xr-x - mapred hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/mapred drwxrwxrwx - hdfs hadoop 0 2014-07-10 22:29 hdfs://hdm2:8020/tmp drwxrwxrwx - hdfs hadoop 0 2014-06-16 18:11 hdfs://hdm2:8020/user drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/yarn
2. Create a temporary text file and put it to HDFS
[gpadmin@admin ~]$cat test1.txt 15,west 25,east [gpadmin@admin ~] hdfs dfs -put test1.txt hdfs://hdm2:8020/tmp/ [gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp Found 1 items -rw-r--r-- 3 gpadmin hadoop 16 2014-07-06 22:38 hdfs://hdm2:8020/tmp/test1.txt
3. Create a readable external table in GPDB pointing it to sample file (test1.txt) in HDFS
NOTE: The blow example location field is for a single namenode deployment. In the case of two Namenodes (High Availability ) location field would be as followed "gphdfs:///tmp/test1.txt". We do not include a port and replace
hostname with the name configured in core-site.xml for param "fs.defaultFS"
initdb=# create external table test_hdfs (age int, name text) location('gphdfs://hdm2:8020/tmp/test1.txt') format 'text' (delimiter ','); CREATE EXTERNAL TABLE initdb=# \d test_hdfs External table "public.test_hdfs" Column | Type | Modifiers --------+---------+----------- age | integer | name | text | Type: readable Encoding: UTF8 Format type: text Format options: delimiter ',' null '\N' escape '\' External location: gphdfs://hdm2:8020/tmp/test1.txt
4. Try query data from the external table
initdb=# select * from test_hdfs; age | name -----+------ 15 | west 25 | east (2 rows)
5. Create a writable external table in GPDB pointing to a file in HDFS
initdb=# select * from myt1; id | name ------+------- 1000 | Jason (1 row) initdb=# create writable external table test_hdfs2 (like myt1) location('gphdfs://hdm2:8020/tmp/test2.txt') format 'text' (delimiter ','); NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table CREATE EXTERNAL TABLE
6. Insert data to the writable external table
initdb=# insert into test_hdfs2 select * from myt1; INSERT 0 1
7. Check existence and content of file in HDFS
[gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp/test2.txt Found 1 items -rw-r--r-- 3 gpadmin hadoop 11 2014-07-13 23:37 hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098 [gpadmin@admin ~]$hdfs dfs -cat hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098 1000,Jason
相关文章推荐
- HDFS数据恢复模式
- cloudera learning2:HDFS
- hadoop集群之间的hdfs文件拷贝
- pyspark 如何删除hdfs文件
- Hadoop中Yarnrunner里面submit Job以及AM生成 至Job处理过程源码解析 (中)
- HDFS写文件过程分析
- HDFS Truncate文件截断
- HQL加载数据的几种方法小结
- Hadoop中Yarnrunner里面submit Job以及AM生成 至Job处理过程源码解析 (上)
- 将hdfs上多个文本数据生成mllib的训练集测试集
- HDFS HA支持多Standby节点机制
- logstash-out-hdfs
- HDFS学习笔记
- 从hdfs读取文件存到hbase
- hadoop的HA实现,超详细(一)
- hadoop的HA实现,超详细
- hadoop hdfs dfsadmin 介绍
- hdfs 配额指南
- HDFS快照
- WebHdfs API使用和开放WebHdfs使用后权限控制