Hawq学习笔记 --- How to access HDFS data via GPDB external table with gphdfs protocol

Greenplum database 4.2.x, 4.3.x
Pivotal HD 1.x, 2.x

Download jdk (1.7 is recommended) package and install it on all servers of the GPDB cluster 
Download Pivotal HD installation
package (same version as that of PHD cluster to be access) and put it on GPDB master host

Install required PHD packages on GPDB cluster
Option 1

1. Unpack PHD installation tallball. Here take PHD-2.0.1 release as an example
[root@admin phd201]# tar xvfz PHD-

... ...

And find out the following rpm packages in the unpacked directory

2. Install the rpm packages (follow the order listed above to avoid dependency check error). This needs to be done on all segment servers of GPDB cluster
[root@admin phd201]# cd PHD-

[root@admin PHD-]#rpm -ivh \
utility/rpm/bigtop-jsvc-1.0.15_gphd_2_0_1_0-43.x86_64.rpm \
utility/rpm/bigtop-utils-0.4.0_gphd_2_0_1_0-43.noarch.rpm \
zookeeper/rpm/zookeeper-3.4.5_gphd_2_0_1_0-43.noarch.rpm \
hadoop/rpm/hadoop-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-yarn-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \
hadoop/rpm/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0-43.x86_64.rpm \

3. Configure hadoop configuration files & ensure that hdfs works.

Option 2

1. If there is admin node (where Pivotal Commander Center is running) available on the target PHD cluster, then the required rpm packages could be found under /usr/lib/gphd/rpms
on admin node. 

2. Install the rpm packages on all segment hosts of GPDB cluster through either of the following ways.

    a)  Copy those rpm packages to each segment server and install them with "rpm -ivh" command manually

    b) Add a repo file (like gphd.repo) under /etc/yum.repos.d on every segment server with the content below. Note that "admin.hadoop.local" is the hostname of admin node on your site, which needs
to be modified accordingly.
name=PHD Admin Node Repo

3. Configure hadoop configuration files & ensure that hdfs works.

Then run "yum install <rpm package name>" to complete installation.

GPDB Configuration

1. Set environment variable JAVA_HOME for gpadmin user correctly on all segment servers, as illustrated below. Better to set it in .bashrc or .bash_profile
[gpadmin@admin ~]$echo $JAVA_HOME

[gpadmin@admin ~]$ls -l /usr/java/default
lrwxrwxrwx 1 root root 16 Jul 18 2013 /usr/java/default -> /usr/java/latest
[gpadmin@admin ~]$ls -l /usr/java/latest
lrwxrwxrwx 1 root root 21 Dec 15 2013 /usr/java/latest -> /usr/java/jdk1.7.0_25

[gpadmin@admin ~]$cat .bash_profile | grep JAVA_HOME
export JAVA_HOME=/usr/java/default

2. Set the parameters for GPDB
[gpadmin@admin ~]$ gpconfig -c gp_hadoop_home -v "'/usr/lib/gphd'"
[gpadmin@admin ~]$ gpconfig -c gp_hadoop_target_version -v "'gphd-2.0'"

Run "gpstop -u" to take the change into effect. 


1. Check HDFS is accessible from any of the segment servers
[gpadmin@sdw1 ~]$hdfs dfs -ls hdfs://hdm2:8020/

Found 7 items
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/apps
drwxr-xr-x - postgres gpadmin 0 2014-06-16 04:55 hdfs://hdm2:8020/hawq_data
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/hive
drwxr-xr-x - mapred hadoop 0 2014-06-14 21:20 hdfs://hdm2:8020/mapred
drwxrwxrwx - hdfs hadoop 0 2014-07-10 22:29 hdfs://hdm2:8020/tmp
drwxrwxrwx - hdfs hadoop 0 2014-06-16 18:11 hdfs://hdm2:8020/user
drwxr-xr-x - hdfs hadoop 0 2014-06-14 21:21 hdfs://hdm2:8020/yarn

2. Create a temporary text file and put it to HDFS
[gpadmin@admin ~]$cat test1.txt


[gpadmin@admin ~] hdfs dfs -put test1.txt hdfs://hdm2:8020/tmp/

[gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp
Found 1 items
-rw-r--r-- 3 gpadmin hadoop 16 2014-07-06 22:38 hdfs://hdm2:8020/tmp/test1.txt

3. Create a readable external table in GPDB pointing it to sample file (test1.txt) in HDFS

NOTE: The blow example location field is for a single namenode deployment. In the case of two Namenodes (High Availability ) location field would be as followed "gphdfs:///tmp/test1.txt". We do not include a port and replace
hostname with the name configured in core-site.xml for param "fs.defaultFS"
initdb=# create external table test_hdfs (age int, name text) location('gphdfs://hdm2:8020/tmp/test1.txt') format 'text' (delimiter ',');


initdb=# \d test_hdfs
External table "public.test_hdfs"
Column | Type | Modifiers
age | integer |
name | text |
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter ',' null '\N' escape '\'
External location: gphdfs://hdm2:8020/tmp/test1.txt

4. Try query data from the external table
initdb=# select * from test_hdfs;
age | name
15 | west
25 | east
(2 rows)

5. Create a writable external table in GPDB pointing to a file in HDFS
initdb=# select * from myt1;

id | name
1000 | Jason
(1 row)

initdb=# create writable external table test_hdfs2 (like myt1) location('gphdfs://hdm2:8020/tmp/test2.txt') format 'text' (delimiter ',');

NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table

6. Insert data to the writable external table
initdb=# insert into test_hdfs2 select * from myt1;


7. Check existence and content of file in HDFS
[gpadmin@admin ~]$hdfs dfs -ls hdfs://hdm2:8020/tmp/test2.txt

Found 1 items
-rw-r--r-- 3 gpadmin hadoop 11 2014-07-13 23:37 hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098
[gpadmin@admin ~]$hdfs dfs -cat hdfs://hdm2:8020/tmp/test2.txt/0_1402800555-0000000098
