Greenplum数据库建立外部表加载HDFS文件实践指导

南墨2年前技术文章1332

环境概述

(1)     Greenplum数据库版本号

Greenplum Database 4.3.0.0POC3 build 45206(基于PostgreSQL 8.2.15

GP软件包为:greenplum-db-4.3.0.0-SuSE10-x86_64.zip

 

(2)     Huawei Hadoop版本号为:

root@hadoop1:~> hadoop version

Hadoop 2.2

 

Greenplum节点配置

(1)     要求所有Greenplum DB数据库节点都部署Hadoop客户端

方法一:可以使用IDeploy工具在GPDB节点都安装Hadoop

方法二:可以将hadoop客户端节点上面已经存在的Hadoop路径拷贝到GPDB节点上面

这里,我只举例说明方法二(建议这种方法,比较省事方便):

A.     查看Hadoop的安装路径

root@hadoopcli1:~/hadoop> pwd

/opt/hadoop/     -----Hadoop安装路径

 

B.     执行远程拷贝命令,将/home/hadoop/hadoop整个目录拷贝到GPDB所有节点指定目录下面(所有GP节点Hadoop目录相同):

scp -r /home/hadoop/ gpadmin@xxx.xxx.xxx.xxx:/home/gpadmin/

 

       注:实际上GPDB使用GPHDFS协议加载Hadoop文件系统中数据文件时会用到Hadoop的一些jar包,而这些jar包基本上都位于/opt/hadoop/share/hadoop目录下面,包括HDFSmapreduceyarnjar包。

 

(2)     要求所有Greenplum DB数据库节点都部署JDKJDK版本号与Hadoop使用的一致)

部署Hadoop时,已经部署了JDK

        比如路径为:

    root@hadoop1:~> echo $JAVA_HOME

/opt/java

 

可以将Hadoop环境上面已经存在的JDK安装路径拷贝到GPDB节点上面

这里,我只举例说明方法二(建议这种方法,比较省事方便):

scp -r /opt/java gpadmin@xxx.xxx.xxx.xxx:/opt/gpadmin/

 

(3)     配置GPDB加载Hadoopjar包的环境变量文件(所有GPDB节点都修改)

编辑配置文件/usr/local/greenplum-db/lib/hadoop/hadoop_env.sh

将如下引入jar包脚本内容加入到hadoop_env.sh文件中:

 

for f in $HADOOP_HOME/share/hadoop/bdi/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/mapreduce/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/mapreduce/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/yarn/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/yarn/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/hdfs/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/hdfs/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/tools/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/common/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/common/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/hadoop/common/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

另外将文件中的如下内容:

JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m ${HADOOP_JAVA_PLATFORM_OPTS} org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`

修改为:

JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m -classpath /usr/local/greenplum-db/lib/hadoop/hadoop-auth-*.jar org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`

 

 

(4)     添加GPDB所有节点配置环境变量

修改安装Greenplum DB数据库用户家目录下面的环境变量文件.bashrc,添加如下内容:

export JAVA_HOME=/home/gpadmin/jdk*

export CLASSPATH=$JAVA_HOME/lib/tools.jar

export HADOOP_HOME=/opt/hadoop

export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

 

(5)     修改Master节点的/data/master/gpseg-1/postgresql.conf配置文件,添加如下内容:

gp_hadoop_home='/opt/dbadmin/hadoop/'

gp_hadoop_target_version='gphd-2.0'

 

(6)     重启GPDB数据库

gpstop -r

 

测试创建外部表

(1)     HadoopHDFS里面准备测试数据

查看准备的全部数据文件,以及每个文件大小(32个文件)

hadoop fs -ls -R -h /test_file/test_file_gb_1.dat-*

查看文件内容总行数:

hadoop fs -cat /test_file/test_file_gb_1.dat-* | wc -l

4174417

 

(2)     连接GPDB数据库,创建外部表

CREATE EXTERNAL TABLE test

(

    cell_name       text,                                     

    pro_name        text,

    region_name     text,

    city_name       text,

    city_id text

)

LOCATION ( 'gphdfs://[ip]:[port]/test_file/test_file_gb_1.dat-*')

FORMAT 'text' (delimiter '|')

ENCODING 'UTF8'

LOG ERRORS INTO wide_table_error SEGMENT REJECT LIMIT 1000000 ROWS ;

 

(3)     查看外部表加载的文件内容行数

select count(1) from test;

NOTICE:  Found 4 data formatting errors (4 or more input rows). Rejected related input data.

count 

---------

4174417           ----通过外部表可以查看到所有文件中的内容,没有丢弃数据

(1 row)

 

外部表数据加入到事实表

(1)     创建表

 

CREATE TABLE fact_test

(

    cell_name       text,                                     

    pro_name        text,

    region_name     text,

    city_name       text

    city_id text

) distributed by (city_name,city_id);

 

(2)     通过外部表,插入数据

noas=# insert into fact_wide_table select * from fact_test;

NOTICE:  Found 4 data formatting errors (4 or more input rows). Rejected related input data.

INSERT 0 4174417       ------数据全部加载到事实表

(3)     查询

可以查询事实表的前10行看数据是否存在:

select * from fact_test limit 10;

 

GP外部表访问FI上的HDFS文件:

1、 安装GPmaster,segment,gpadmin/gpadmin,家目录:/opt/gpadmin

2、 GPmaster,segment上安装FIclient,安装目录如:/opt/gpadmin/hadoopcli

     3GPmaster节点,配置:

            /opt/greenplum/master/gpseg-1目录下打开:postgresql.conf,去掉注释:

            gp_external_enable_exec = on   # enable external tables with EXECUTE.

            gp_external_grant_privileges = on #enable create http/gpfdist for non su's

             

            新增:

            gp_hadoop_home='/opt/gpadmin/hadoopcli/HDFS/hadoop'

            gp_hadoop_target_version='gphd-2.0'

 

     4GPMaster,segment节点,gpadmin的家目录下,修改.bashr,并source生效:

 

            export LC_ALL=en_US.utf-8

            export.utf-8

            source /usr/local/greenplum-db/greenplum_path.sh

            export MASTER_DATA_DIRECTORY=/opt/greenplum/master/gpseg-1

            export PGPORT=5432

            export PGUSER=gpadmin

 

            export HADOOP_HOME=/opt/gpadmin/hadoopcli/HDFS/hadoop

            export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

            export JAVA_HOME=/opt/java

            export CLASSPATH=$JAVA_HOME/lib/tools.jar

            #以下配置安全集群填写

            export LD_LIBRARY_PATH=/opt/kerberos/lib:${LD_LIBRARY_PATH}

             

     5GPmaster,segment节点上安装的FI客户端,修改如下目录的文件:/opt/gpadmin/hadoopcli/HDFS/hadoop/etc/hadoop/hdfs-site.xml,新增如下配置,其中hdfs.keytab是从hadoop服务端上获取并放置在该客户端目录/opt/gpadmin/hadoopcli

 

            <property>

            <name>com.emc.greenplum.gpdb.hdfsconnector.security.user.name</name>

            <value>hdfs/hadoop@HADOOP.COM</value>

            </property>

            <property>

            <name>com.emc.greenplum.gpdb.hdfsconnector.security.user.keytab.file</name>

            <value>/opt/gpadmin/hdfs.keytab</value>

            </property>

 

      6、在GPmaster,segment节点上,把hadoop节点下的/etc/krb5.conf,拷贝覆盖操作系统的/etc下的krb5.conf

 

            注意:GP上所有配置生效后,重启GP

 

            经过以上6步骤,可以在GP上建立外部表,并在GP上可以查询到:

 

图片 1.png

 

    FAQ1,未设置正确环境变量或未安装好FI客户端(参考上面1~4配置):

        14:02:16  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. 15/01/14 14:02:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  (seg5 slice1 linux4390:40005 pid=23929)

          Detail:

        

        Exception in thread "main" org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]

               at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

               at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

               at sun.reflect.DelegatingConstructorAcc

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]

 

 

 

        FAQ2GP侧的FI客户端要配置读取keytab的用户和对应文件目录(参考上面第5步),否则:

 

        ... Physical database connection acquired for: gp

         15:17:56  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. 15/01/14 15:17:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  (seg0 slice1 linux4390:40000 pid=8466)

          Detail:

        

        Exception in thread "main" java.io.IOException: Running in secure mode, but config doesn't have a keytab

               at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:235)

               at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:206)

               at com.emc.greenplum.gpdb.hdfsconnector.ConnectorUtil.loginSecureHadoop(ConnectorUtil.java:84)

               at com.e

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]

         

        FAQ3,需要把hadoop节点krb5.conf拷贝覆盖greenplum操作系统/etc目录下的krb5.conf(参考:6),否则:

 

        ... Physical database connection acquired for: gp

         15:38:15  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm  (seg1 slice1 linux4390:40001 pid=40524)

          Detail:

                

               at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)

               at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:248)

               at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:233)

               at org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:310)

               at org.apach

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]


相关文章

Wireshark工具使用

Wireshark工具使用

一、介绍wireshark是一款免费、开源的网络抓包工具。项目地址如下:https://github.com/wireshark/wireshark应用如下:1. 网络管理员会使用wireshark来...

Linux高并发FastCGI优化

nginx中FastCGI相关参数优化:1)这个指令为FastCGI缓存指定一个路径,目录结构等级,关键字区域存储时间和非活动删除时间。fastcgi_cache_path /usr/local/ng...

MySQL DDL 风险评估

MySQL DDL 风险评估

一、前言变更是数据库离不开的话题,从 MySQL 5.6 开始,推出 online DDL 即变更期间不锁表,本篇文章介绍 MySQL 变更对数据库的影响如何去判断。二、DDL 风险提示1. 变更速查...

MySQL 8.0 新特性:Clone Plugin

MySQL 8.0 新特性:Clone Plugin

一、前言MySQL 在 8.0.17 加入了克隆插件,可以从本地或者远程克隆数据,比如需要创建主从关系,我们一般都是通过物理备份来做,那如果你使用的是 MySQL 8.0.17 及以上的版本那么就可以...

python-序列化和反序列化

1、为什么要序列化内存中的字典、列表、集合以及各种对象,如何保存到一个文件中?如果是自己定义的类的实例,如何保存到一个文件中?如何从文件中读取数据,并让它们在内存中再次恢复成自己对应的类的实例?要设计...

MySQL 二进制安装

MySQL 二进制安装

一、前言● 介绍:业务环境安装 MySQL 一般都会通过二进制来安装,今天就记录一下业务环境二进制安装 MySQL 过程。● 环境:CentOS 7 安装:MySQL 5.7+二、安装 MySQL1....

发表评论    

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。