Greenplum数据库建立外部表加载HDFS文件实践指导

南墨2年前技术文章1263

环境概述

(1)     Greenplum数据库版本号

Greenplum Database 4.3.0.0POC3 build 45206(基于PostgreSQL 8.2.15

GP软件包为:greenplum-db-4.3.0.0-SuSE10-x86_64.zip

 

(2)     Huawei Hadoop版本号为:

root@hadoop1:~> hadoop version

Hadoop 2.2

 

Greenplum节点配置

(1)     要求所有Greenplum DB数据库节点都部署Hadoop客户端

方法一:可以使用IDeploy工具在GPDB节点都安装Hadoop

方法二:可以将hadoop客户端节点上面已经存在的Hadoop路径拷贝到GPDB节点上面

这里,我只举例说明方法二(建议这种方法,比较省事方便):

A.     查看Hadoop的安装路径

root@hadoopcli1:~/hadoop> pwd

/opt/hadoop/     -----Hadoop安装路径

 

B.     执行远程拷贝命令,将/home/hadoop/hadoop整个目录拷贝到GPDB所有节点指定目录下面(所有GP节点Hadoop目录相同):

scp -r /home/hadoop/ gpadmin@xxx.xxx.xxx.xxx:/home/gpadmin/

 

       注:实际上GPDB使用GPHDFS协议加载Hadoop文件系统中数据文件时会用到Hadoop的一些jar包,而这些jar包基本上都位于/opt/hadoop/share/hadoop目录下面,包括HDFSmapreduceyarnjar包。

 

(2)     要求所有Greenplum DB数据库节点都部署JDKJDK版本号与Hadoop使用的一致)

部署Hadoop时,已经部署了JDK

        比如路径为:

    root@hadoop1:~> echo $JAVA_HOME

/opt/java

 

可以将Hadoop环境上面已经存在的JDK安装路径拷贝到GPDB节点上面

这里,我只举例说明方法二(建议这种方法,比较省事方便):

scp -r /opt/java gpadmin@xxx.xxx.xxx.xxx:/opt/gpadmin/

 

(3)     配置GPDB加载Hadoopjar包的环境变量文件(所有GPDB节点都修改)

编辑配置文件/usr/local/greenplum-db/lib/hadoop/hadoop_env.sh

将如下引入jar包脚本内容加入到hadoop_env.sh文件中:

 

for f in $HADOOP_HOME/share/hadoop/bdi/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/mapreduce/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/mapreduce/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/yarn/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/yarn/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/hdfs/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/hdfs/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/tools/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/common/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/common/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/hadoop/common/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

另外将文件中的如下内容:

JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m ${HADOOP_JAVA_PLATFORM_OPTS} org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`

修改为:

JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m -classpath /usr/local/greenplum-db/lib/hadoop/hadoop-auth-*.jar org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`

 

 

(4)     添加GPDB所有节点配置环境变量

修改安装Greenplum DB数据库用户家目录下面的环境变量文件.bashrc,添加如下内容:

export JAVA_HOME=/home/gpadmin/jdk*

export CLASSPATH=$JAVA_HOME/lib/tools.jar

export HADOOP_HOME=/opt/hadoop

export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

 

(5)     修改Master节点的/data/master/gpseg-1/postgresql.conf配置文件,添加如下内容:

gp_hadoop_home='/opt/dbadmin/hadoop/'

gp_hadoop_target_version='gphd-2.0'

 

(6)     重启GPDB数据库

gpstop -r

 

测试创建外部表

(1)     HadoopHDFS里面准备测试数据

查看准备的全部数据文件,以及每个文件大小(32个文件)

hadoop fs -ls -R -h /test_file/test_file_gb_1.dat-*

查看文件内容总行数:

hadoop fs -cat /test_file/test_file_gb_1.dat-* | wc -l

4174417

 

(2)     连接GPDB数据库,创建外部表

CREATE EXTERNAL TABLE test

(

    cell_name       text,                                     

    pro_name        text,

    region_name     text,

    city_name       text,

    city_id text

)

LOCATION ( 'gphdfs://[ip]:[port]/test_file/test_file_gb_1.dat-*')

FORMAT 'text' (delimiter '|')

ENCODING 'UTF8'

LOG ERRORS INTO wide_table_error SEGMENT REJECT LIMIT 1000000 ROWS ;

 

(3)     查看外部表加载的文件内容行数

select count(1) from test;

NOTICE:  Found 4 data formatting errors (4 or more input rows). Rejected related input data.

count 

---------

4174417           ----通过外部表可以查看到所有文件中的内容,没有丢弃数据

(1 row)

 

外部表数据加入到事实表

(1)     创建表

 

CREATE TABLE fact_test

(

    cell_name       text,                                     

    pro_name        text,

    region_name     text,

    city_name       text

    city_id text

) distributed by (city_name,city_id);

 

(2)     通过外部表,插入数据

noas=# insert into fact_wide_table select * from fact_test;

NOTICE:  Found 4 data formatting errors (4 or more input rows). Rejected related input data.

INSERT 0 4174417       ------数据全部加载到事实表

(3)     查询

可以查询事实表的前10行看数据是否存在:

select * from fact_test limit 10;

 

GP外部表访问FI上的HDFS文件:

1、 安装GPmaster,segment,gpadmin/gpadmin,家目录:/opt/gpadmin

2、 GPmaster,segment上安装FIclient,安装目录如:/opt/gpadmin/hadoopcli

     3GPmaster节点,配置:

            /opt/greenplum/master/gpseg-1目录下打开:postgresql.conf,去掉注释:

            gp_external_enable_exec = on   # enable external tables with EXECUTE.

            gp_external_grant_privileges = on #enable create http/gpfdist for non su's

             

            新增:

            gp_hadoop_home='/opt/gpadmin/hadoopcli/HDFS/hadoop'

            gp_hadoop_target_version='gphd-2.0'

 

     4GPMaster,segment节点,gpadmin的家目录下,修改.bashr,并source生效:

 

            export LC_ALL=en_US.utf-8

            export.utf-8

            source /usr/local/greenplum-db/greenplum_path.sh

            export MASTER_DATA_DIRECTORY=/opt/greenplum/master/gpseg-1

            export PGPORT=5432

            export PGUSER=gpadmin

 

            export HADOOP_HOME=/opt/gpadmin/hadoopcli/HDFS/hadoop

            export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

            export JAVA_HOME=/opt/java

            export CLASSPATH=$JAVA_HOME/lib/tools.jar

            #以下配置安全集群填写

            export LD_LIBRARY_PATH=/opt/kerberos/lib:${LD_LIBRARY_PATH}

             

     5GPmaster,segment节点上安装的FI客户端,修改如下目录的文件:/opt/gpadmin/hadoopcli/HDFS/hadoop/etc/hadoop/hdfs-site.xml,新增如下配置,其中hdfs.keytab是从hadoop服务端上获取并放置在该客户端目录/opt/gpadmin/hadoopcli

 

            <property>

            <name>com.emc.greenplum.gpdb.hdfsconnector.security.user.name</name>

            <value>hdfs/hadoop@HADOOP.COM</value>

            </property>

            <property>

            <name>com.emc.greenplum.gpdb.hdfsconnector.security.user.keytab.file</name>

            <value>/opt/gpadmin/hdfs.keytab</value>

            </property>

 

      6、在GPmaster,segment节点上,把hadoop节点下的/etc/krb5.conf,拷贝覆盖操作系统的/etc下的krb5.conf

 

            注意:GP上所有配置生效后,重启GP

 

            经过以上6步骤,可以在GP上建立外部表,并在GP上可以查询到:

 

图片 1.png

 

    FAQ1,未设置正确环境变量或未安装好FI客户端(参考上面1~4配置):

        14:02:16  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. 15/01/14 14:02:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  (seg5 slice1 linux4390:40005 pid=23929)

          Detail:

        

        Exception in thread "main" org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]

               at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

               at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

               at sun.reflect.DelegatingConstructorAcc

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]

 

 

 

        FAQ2GP侧的FI客户端要配置读取keytab的用户和对应文件目录(参考上面第5步),否则:

 

        ... Physical database connection acquired for: gp

         15:17:56  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. 15/01/14 15:17:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  (seg0 slice1 linux4390:40000 pid=8466)

          Detail:

        

        Exception in thread "main" java.io.IOException: Running in secure mode, but config doesn't have a keytab

               at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:235)

               at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:206)

               at com.emc.greenplum.gpdb.hdfsconnector.ConnectorUtil.loginSecureHadoop(ConnectorUtil.java:84)

               at com.e

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]

         

        FAQ3,需要把hadoop节点krb5.conf拷贝覆盖greenplum操作系统/etc目录下的krb5.conf(参考:6),否则:

 

        ... Physical database connection acquired for: gp

         15:38:15  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm  (seg1 slice1 linux4390:40001 pid=40524)

          Detail:

                

               at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)

               at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:248)

               at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:233)

               at org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:310)

               at org.apach

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]


相关文章

企业级大数据安全架构(九)

企业级大数据安全架构(九)

一、FreeIPA管理员密码忘记后如何修改1.1重置Directory Server管理员密码1.1.1停止directory server服务[root@ipa schema]# start-dir...

网络抓包实战

网络抓包实战

1、为什么要学习抓包?主要是因为碰到网络相关问题,如果不进行抓包的话,可能会很难进行定位解决,抓包可以看做是另类的日志信息收集。对于解决网络问题至关重要。2、抓包普通环境tcpdump -- a po...

大数据集群部署规划(四)组件建议规格

类型指标名称规格说明HDFS单对NameNode最大文件数1亿-单DataNode最大block数500万GC参数 -Xmx32G单个DataNode磁盘最多block数50万-单个目录下最多文件目录...

使用udev配置asm

环境:centos7    1、udev介绍udev概念udev 是 Linux2.6 内核里的一个功能,它替代了原来的 devfs,成为当前 Linux 默认的设备管理工具。udev 以守护进程的形...

prometheus黑盒监控

prometheus黑盒监控

一.背景黑盒监控:主要关注的现象,一般都是正在发生的东西,例如出现一个告警,业务接口不正常,那么这种监控就是站在用户的角度能看到的监控,重点在于能对正在发生的故障进行告警。二.操作前了解相关配置和要求...

Shell中单引号和双引号区别

1)在/home/atguigu/bin创建一个test.sh文件[atguigu@hadoop102 bin]$ vim test.sh在文件中添加如下内容#!/bin/bashdo_date=$1...

发表评论    

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。