Greenplum数据库建立外部表加载HDFS文件实践指导

南墨2年前技术文章1001

环境概述

(1)     Greenplum数据库版本号

Greenplum Database 4.3.0.0POC3 build 45206(基于PostgreSQL 8.2.15

GP软件包为:greenplum-db-4.3.0.0-SuSE10-x86_64.zip

 

(2)     Huawei Hadoop版本号为:

root@hadoop1:~> hadoop version

Hadoop 2.2

 

Greenplum节点配置

(1)     要求所有Greenplum DB数据库节点都部署Hadoop客户端

方法一:可以使用IDeploy工具在GPDB节点都安装Hadoop

方法二:可以将hadoop客户端节点上面已经存在的Hadoop路径拷贝到GPDB节点上面

这里,我只举例说明方法二(建议这种方法,比较省事方便):

A.     查看Hadoop的安装路径

root@hadoopcli1:~/hadoop> pwd

/opt/hadoop/     -----Hadoop安装路径

 

B.     执行远程拷贝命令,将/home/hadoop/hadoop整个目录拷贝到GPDB所有节点指定目录下面(所有GP节点Hadoop目录相同):

scp -r /home/hadoop/ gpadmin@xxx.xxx.xxx.xxx:/home/gpadmin/

 

       注:实际上GPDB使用GPHDFS协议加载Hadoop文件系统中数据文件时会用到Hadoop的一些jar包,而这些jar包基本上都位于/opt/hadoop/share/hadoop目录下面,包括HDFSmapreduceyarnjar包。

 

(2)     要求所有Greenplum DB数据库节点都部署JDKJDK版本号与Hadoop使用的一致)

部署Hadoop时,已经部署了JDK

        比如路径为:

    root@hadoop1:~> echo $JAVA_HOME

/opt/java

 

可以将Hadoop环境上面已经存在的JDK安装路径拷贝到GPDB节点上面

这里,我只举例说明方法二(建议这种方法,比较省事方便):

scp -r /opt/java gpadmin@xxx.xxx.xxx.xxx:/opt/gpadmin/

 

(3)     配置GPDB加载Hadoopjar包的环境变量文件(所有GPDB节点都修改)

编辑配置文件/usr/local/greenplum-db/lib/hadoop/hadoop_env.sh

将如下引入jar包脚本内容加入到hadoop_env.sh文件中:

 

for f in $HADOOP_HOME/share/hadoop/bdi/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/mapreduce/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/mapreduce/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/yarn/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/yarn/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/hdfs/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/hdfs/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/tools/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/common/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/common/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/hadoop/common/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

另外将文件中的如下内容:

JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m ${HADOOP_JAVA_PLATFORM_OPTS} org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`

修改为:

JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m -classpath /usr/local/greenplum-db/lib/hadoop/hadoop-auth-*.jar org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`

 

 

(4)     添加GPDB所有节点配置环境变量

修改安装Greenplum DB数据库用户家目录下面的环境变量文件.bashrc,添加如下内容:

export JAVA_HOME=/home/gpadmin/jdk*

export CLASSPATH=$JAVA_HOME/lib/tools.jar

export HADOOP_HOME=/opt/hadoop

export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

 

(5)     修改Master节点的/data/master/gpseg-1/postgresql.conf配置文件,添加如下内容:

gp_hadoop_home='/opt/dbadmin/hadoop/'

gp_hadoop_target_version='gphd-2.0'

 

(6)     重启GPDB数据库

gpstop -r

 

测试创建外部表

(1)     HadoopHDFS里面准备测试数据

查看准备的全部数据文件,以及每个文件大小(32个文件)

hadoop fs -ls -R -h /test_file/test_file_gb_1.dat-*

查看文件内容总行数:

hadoop fs -cat /test_file/test_file_gb_1.dat-* | wc -l

4174417

 

(2)     连接GPDB数据库,创建外部表

CREATE EXTERNAL TABLE test

(

    cell_name       text,                                     

    pro_name        text,

    region_name     text,

    city_name       text,

    city_id text

)

LOCATION ( 'gphdfs://[ip]:[port]/test_file/test_file_gb_1.dat-*')

FORMAT 'text' (delimiter '|')

ENCODING 'UTF8'

LOG ERRORS INTO wide_table_error SEGMENT REJECT LIMIT 1000000 ROWS ;

 

(3)     查看外部表加载的文件内容行数

select count(1) from test;

NOTICE:  Found 4 data formatting errors (4 or more input rows). Rejected related input data.

count 

---------

4174417           ----通过外部表可以查看到所有文件中的内容,没有丢弃数据

(1 row)

 

外部表数据加入到事实表

(1)     创建表

 

CREATE TABLE fact_test

(

    cell_name       text,                                     

    pro_name        text,

    region_name     text,

    city_name       text

    city_id text

) distributed by (city_name,city_id);

 

(2)     通过外部表,插入数据

noas=# insert into fact_wide_table select * from fact_test;

NOTICE:  Found 4 data formatting errors (4 or more input rows). Rejected related input data.

INSERT 0 4174417       ------数据全部加载到事实表

(3)     查询

可以查询事实表的前10行看数据是否存在:

select * from fact_test limit 10;

 

GP外部表访问FI上的HDFS文件:

1、 安装GPmaster,segment,gpadmin/gpadmin,家目录:/opt/gpadmin

2、 GPmaster,segment上安装FIclient,安装目录如:/opt/gpadmin/hadoopcli

     3GPmaster节点,配置:

            /opt/greenplum/master/gpseg-1目录下打开:postgresql.conf,去掉注释:

            gp_external_enable_exec = on   # enable external tables with EXECUTE.

            gp_external_grant_privileges = on #enable create http/gpfdist for non su's

             

            新增:

            gp_hadoop_home='/opt/gpadmin/hadoopcli/HDFS/hadoop'

            gp_hadoop_target_version='gphd-2.0'

 

     4GPMaster,segment节点,gpadmin的家目录下,修改.bashr,并source生效:

 

            export LC_ALL=en_US.utf-8

            export.utf-8

            source /usr/local/greenplum-db/greenplum_path.sh

            export MASTER_DATA_DIRECTORY=/opt/greenplum/master/gpseg-1

            export PGPORT=5432

            export PGUSER=gpadmin

 

            export HADOOP_HOME=/opt/gpadmin/hadoopcli/HDFS/hadoop

            export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

            export JAVA_HOME=/opt/java

            export CLASSPATH=$JAVA_HOME/lib/tools.jar

            #以下配置安全集群填写

            export LD_LIBRARY_PATH=/opt/kerberos/lib:${LD_LIBRARY_PATH}

             

     5GPmaster,segment节点上安装的FI客户端,修改如下目录的文件:/opt/gpadmin/hadoopcli/HDFS/hadoop/etc/hadoop/hdfs-site.xml,新增如下配置,其中hdfs.keytab是从hadoop服务端上获取并放置在该客户端目录/opt/gpadmin/hadoopcli

 

            <property>

            <name>com.emc.greenplum.gpdb.hdfsconnector.security.user.name</name>

            <value>hdfs/hadoop@HADOOP.COM</value>

            </property>

            <property>

            <name>com.emc.greenplum.gpdb.hdfsconnector.security.user.keytab.file</name>

            <value>/opt/gpadmin/hdfs.keytab</value>

            </property>

 

      6、在GPmaster,segment节点上,把hadoop节点下的/etc/krb5.conf,拷贝覆盖操作系统的/etc下的krb5.conf

 

            注意:GP上所有配置生效后,重启GP

 

            经过以上6步骤,可以在GP上建立外部表,并在GP上可以查询到:

 

图片 1.png

 

    FAQ1,未设置正确环境变量或未安装好FI客户端(参考上面1~4配置):

        14:02:16  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. 15/01/14 14:02:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  (seg5 slice1 linux4390:40005 pid=23929)

          Detail:

        

        Exception in thread "main" org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]

               at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

               at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

               at sun.reflect.DelegatingConstructorAcc

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]

 

 

 

        FAQ2GP侧的FI客户端要配置读取keytab的用户和对应文件目录(参考上面第5步),否则:

 

        ... Physical database connection acquired for: gp

         15:17:56  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. 15/01/14 15:17:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  (seg0 slice1 linux4390:40000 pid=8466)

          Detail:

        

        Exception in thread "main" java.io.IOException: Running in secure mode, but config doesn't have a keytab

               at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:235)

               at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:206)

               at com.emc.greenplum.gpdb.hdfsconnector.ConnectorUtil.loginSecureHadoop(ConnectorUtil.java:84)

               at com.e

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]

         

        FAQ3,需要把hadoop节点krb5.conf拷贝覆盖greenplum操作系统/etc目录下的krb5.conf(参考:6),否则:

 

        ... Physical database connection acquired for: gp

         15:38:15  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm  (seg1 slice1 linux4390:40001 pid=40524)

          Detail:

                

               at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)

               at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:248)

               at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:233)

               at org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:310)

               at org.apach

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]


相关文章

MinIO文件服务

MinIO文件服务

MinIO官网地址:docs.min.io/cn/一、Minio简介Minio 是一个基于Apache License v2.0开源协议的对象存储服务。它兼容亚马逊S3云存储服务接口,非常适合于存储大...

Hbase部署

安装前准备1.1. 设置环境变量所有hbase节点都要做vi /etc/profile export HBASE_HOME=/opt/hbaseexport PATH=$PATH:$HBASE_HOM...

mysql部署

安装前准备创建用户和目录创建用户:groupadd mysqluseradd -g mysql mysql 创建目录mkdir /data/mysqlmkdir /log/mysql添加环境变量vim...

RMAN-08137处理

现象:删除归档的时候报错:RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture p...

Docker镜像是有仓库

在Docker中,当我们执行 docker pull xxx 的时候 ,它实际上是从 hub.docker.com 这个地址去查找,这就是 Docker 公司为我们提供的公共仓库。在工作中,我们不可能...

离线部署Mysql 8.X

离线部署Mysql 8.X

安装包和环境准备将mysql-8.0.23-el7-x86_64.tar.gz安装包上传至服务器wget https://cdn.mysql.com/archives/mysql-8.0/mysql-...

发表评论    

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。