Greenplum数据库建立外部表加载HDFS文件实践指导

南墨1年前技术文章659

环境概述

(1)     Greenplum数据库版本号

Greenplum Database 4.3.0.0POC3 build 45206(基于PostgreSQL 8.2.15

GP软件包为:greenplum-db-4.3.0.0-SuSE10-x86_64.zip

 

(2)     Huawei Hadoop版本号为:

root@hadoop1:~> hadoop version

Hadoop 2.2

 

Greenplum节点配置

(1)     要求所有Greenplum DB数据库节点都部署Hadoop客户端

方法一:可以使用IDeploy工具在GPDB节点都安装Hadoop

方法二:可以将hadoop客户端节点上面已经存在的Hadoop路径拷贝到GPDB节点上面

这里,我只举例说明方法二(建议这种方法,比较省事方便):

A.     查看Hadoop的安装路径

root@hadoopcli1:~/hadoop> pwd

/opt/hadoop/     -----Hadoop安装路径

 

B.     执行远程拷贝命令,将/home/hadoop/hadoop整个目录拷贝到GPDB所有节点指定目录下面(所有GP节点Hadoop目录相同):

scp -r /home/hadoop/ gpadmin@xxx.xxx.xxx.xxx:/home/gpadmin/

 

       注:实际上GPDB使用GPHDFS协议加载Hadoop文件系统中数据文件时会用到Hadoop的一些jar包,而这些jar包基本上都位于/opt/hadoop/share/hadoop目录下面,包括HDFSmapreduceyarnjar包。

 

(2)     要求所有Greenplum DB数据库节点都部署JDKJDK版本号与Hadoop使用的一致)

部署Hadoop时,已经部署了JDK

        比如路径为:

    root@hadoop1:~> echo $JAVA_HOME

/opt/java

 

可以将Hadoop环境上面已经存在的JDK安装路径拷贝到GPDB节点上面

这里,我只举例说明方法二(建议这种方法,比较省事方便):

scp -r /opt/java gpadmin@xxx.xxx.xxx.xxx:/opt/gpadmin/

 

(3)     配置GPDB加载Hadoopjar包的环境变量文件(所有GPDB节点都修改)

编辑配置文件/usr/local/greenplum-db/lib/hadoop/hadoop_env.sh

将如下引入jar包脚本内容加入到hadoop_env.sh文件中:

 

for f in $HADOOP_HOME/share/hadoop/bdi/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/mapreduce/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/mapreduce/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/yarn/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/yarn/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/hdfs/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/hdfs/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/tools/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/common/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/share/hadoop/common/lib/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

for f in $HADOOP_HOME/hadoop/common/*.jar; do

  CLASSPATH=${CLASSPATH}:$f;

done

 

另外将文件中的如下内容:

JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m ${HADOOP_JAVA_PLATFORM_OPTS} org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`

修改为:

JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} -Xmx32m -classpath /usr/local/greenplum-db/lib/hadoop/hadoop-auth-*.jar org.apache.hadoop.util.PlatformName | sed -e "s/ /_/g"`

 

 

(4)     添加GPDB所有节点配置环境变量

修改安装Greenplum DB数据库用户家目录下面的环境变量文件.bashrc,添加如下内容:

export JAVA_HOME=/home/gpadmin/jdk*

export CLASSPATH=$JAVA_HOME/lib/tools.jar

export HADOOP_HOME=/opt/hadoop

export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

 

(5)     修改Master节点的/data/master/gpseg-1/postgresql.conf配置文件,添加如下内容:

gp_hadoop_home='/opt/dbadmin/hadoop/'

gp_hadoop_target_version='gphd-2.0'

 

(6)     重启GPDB数据库

gpstop -r

 

测试创建外部表

(1)     HadoopHDFS里面准备测试数据

查看准备的全部数据文件,以及每个文件大小(32个文件)

hadoop fs -ls -R -h /test_file/test_file_gb_1.dat-*

查看文件内容总行数:

hadoop fs -cat /test_file/test_file_gb_1.dat-* | wc -l

4174417

 

(2)     连接GPDB数据库,创建外部表

CREATE EXTERNAL TABLE test

(

    cell_name       text,                                     

    pro_name        text,

    region_name     text,

    city_name       text,

    city_id text

)

LOCATION ( 'gphdfs://[ip]:[port]/test_file/test_file_gb_1.dat-*')

FORMAT 'text' (delimiter '|')

ENCODING 'UTF8'

LOG ERRORS INTO wide_table_error SEGMENT REJECT LIMIT 1000000 ROWS ;

 

(3)     查看外部表加载的文件内容行数

select count(1) from test;

NOTICE:  Found 4 data formatting errors (4 or more input rows). Rejected related input data.

count 

---------

4174417           ----通过外部表可以查看到所有文件中的内容,没有丢弃数据

(1 row)

 

外部表数据加入到事实表

(1)     创建表

 

CREATE TABLE fact_test

(

    cell_name       text,                                     

    pro_name        text,

    region_name     text,

    city_name       text

    city_id text

) distributed by (city_name,city_id);

 

(2)     通过外部表,插入数据

noas=# insert into fact_wide_table select * from fact_test;

NOTICE:  Found 4 data formatting errors (4 or more input rows). Rejected related input data.

INSERT 0 4174417       ------数据全部加载到事实表

(3)     查询

可以查询事实表的前10行看数据是否存在:

select * from fact_test limit 10;

 

GP外部表访问FI上的HDFS文件:

1、 安装GPmaster,segment,gpadmin/gpadmin,家目录:/opt/gpadmin

2、 GPmaster,segment上安装FIclient,安装目录如:/opt/gpadmin/hadoopcli

     3GPmaster节点,配置:

            /opt/greenplum/master/gpseg-1目录下打开:postgresql.conf,去掉注释:

            gp_external_enable_exec = on   # enable external tables with EXECUTE.

            gp_external_grant_privileges = on #enable create http/gpfdist for non su's

             

            新增:

            gp_hadoop_home='/opt/gpadmin/hadoopcli/HDFS/hadoop'

            gp_hadoop_target_version='gphd-2.0'

 

     4GPMaster,segment节点,gpadmin的家目录下,修改.bashr,并source生效:

 

            export LC_ALL=en_US.utf-8

            export.utf-8

            source /usr/local/greenplum-db/greenplum_path.sh

            export MASTER_DATA_DIRECTORY=/opt/greenplum/master/gpseg-1

            export PGPORT=5432

            export PGUSER=gpadmin

 

            export HADOOP_HOME=/opt/gpadmin/hadoopcli/HDFS/hadoop

            export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

            export JAVA_HOME=/opt/java

            export CLASSPATH=$JAVA_HOME/lib/tools.jar

            #以下配置安全集群填写

            export LD_LIBRARY_PATH=/opt/kerberos/lib:${LD_LIBRARY_PATH}

             

     5GPmaster,segment节点上安装的FI客户端,修改如下目录的文件:/opt/gpadmin/hadoopcli/HDFS/hadoop/etc/hadoop/hdfs-site.xml,新增如下配置,其中hdfs.keytab是从hadoop服务端上获取并放置在该客户端目录/opt/gpadmin/hadoopcli

 

            <property>

            <name>com.emc.greenplum.gpdb.hdfsconnector.security.user.name</name>

            <value>hdfs/hadoop@HADOOP.COM</value>

            </property>

            <property>

            <name>com.emc.greenplum.gpdb.hdfsconnector.security.user.keytab.file</name>

            <value>/opt/gpadmin/hdfs.keytab</value>

            </property>

 

      6、在GPmaster,segment节点上,把hadoop节点下的/etc/krb5.conf,拷贝覆盖操作系统的/etc下的krb5.conf

 

            注意:GP上所有配置生效后,重启GP

 

            经过以上6步骤,可以在GP上建立外部表,并在GP上可以查询到:

 

图片 1.png

 

    FAQ1,未设置正确环境变量或未安装好FI客户端(参考上面1~4配置):

        14:02:16  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. 15/01/14 14:02:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  (seg5 slice1 linux4390:40005 pid=23929)

          Detail:

        

        Exception in thread "main" org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]

               at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

               at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

               at sun.reflect.DelegatingConstructorAcc

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]

 

 

 

        FAQ2GP侧的FI客户端要配置读取keytab的用户和对应文件目录(参考上面第5步),否则:

 

        ... Physical database connection acquired for: gp

         15:17:56  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. 15/01/14 15:17:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  (seg0 slice1 linux4390:40000 pid=8466)

          Detail:

        

        Exception in thread "main" java.io.IOException: Running in secure mode, but config doesn't have a keytab

               at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:235)

               at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:206)

               at com.emc.greenplum.gpdb.hdfsconnector.ConnectorUtil.loginSecureHadoop(ConnectorUtil.java:84)

               at com.e

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]

         

        FAQ3,需要把hadoop节点krb5.conf拷贝覆盖greenplum操作系统/etc目录下的krb5.conf(参考:6),否则:

 

        ... Physical database connection acquired for: gp

         15:38:15  [SELECT - 0 row(s), 0.000 secs]  [Error Code: 0, SQL State: 38000]  ERROR: external table gphdfs protocol command ended with error. Exception in thread "main" java.lang.IllegalArgumentException: Can't get Kerberos realm  (seg1 slice1 linux4390:40001 pid=40524)

          Detail:

                

               at org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)

               at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:248)

               at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:233)

               at org.apache.hadoop.security.UserGroupInformation.isAuthenticationMethodEnabled(UserGroupInformation.java:310)

               at org.apach

        Command: 'gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt'

                External table ok, file gphdfs://xxx.xxx.xxx.xxx:port/gp/ok.txt

        ... 1 statement(s) executed, 0 row(s) affected, exec/fetch time: 0.000/0.000 sec [0 successful, 0 warnings, 1 errors]


相关文章

Clickhouse数据复制的原理与实践

Clickhouse数据复制的原理与实践

说明在clickhouse中,如果我们想实现数据多副本存储,以提高数据可用率和查询性能,可以使用复制表。支持数据复制的表引擎包括:ReplicatedMergeTreeReplicatedSummin...

MySQL性能优化(三)函数运算导致无法使用索引

MySQL性能优化(三)函数运算导致无法使用索引

有时侯我们会遇到这样的情况:明明字段上已经建立了索引,但是查询还是无法使用索引。其中有一种情况是因为SQL中对索引字段进行了运算。一个例子select * from us...

Linux 文本三剑客 - Grep

grep 是一个最初用于 Unix 操作系统的命令行工具。在给出文件列表或标准输入后,grep 会对匹配一个或多个正则表达式的文本进行搜索,并只输出匹配(或者不匹配)的行或文本。1970 年代,Uni...

服务器cpu异常抖动排查

服务器cpu异常抖动排查

问题背景      当天下午某某客户服务器cpu使用率80%报警持续10分钟左右后又恢复了,联系客户排查,有出现不同的应用服务器都有收到cpu报警,而且出现这样的抖动不止一次,每隔一段时间就会出现类似...

COS快照迁移ES集群

一、COS 全量快照备份基于 COS 快照的迁移方式是使用 ES 的 snapshot api 接口进行迁移,基本原理就是从源 ES 集群创建索引快照,然后在目标 ES 集群中进行恢复。通过 snap...

Kubernetes节点与令牌管理

令牌管理查看令牌```Plain Text [root@master ~]# kubeadm token list## **删除令牌**Plain Text [root@master ~]# ku...

发表评论    

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。