EMR 配置 Hive on Spark

芒果1年前技术文章462

Hive3 on spark 集成
前置条件
hadoop yarn环境正常
oracle jdk 1.8版本
1、spark2 下载准备
https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop.tgz
解压到opt目录
hive3环境配置
export SPARK_HOME=/opt/spark-2.4.5-bin-without-hadoop
export SPARK_CONF_DIR=/opt/spark-2.4.5-bin-without-hadoop/conf
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*
hive-site.xml
  <!-- Spark2 依赖库位置,在YARN 上运行的任务需要从HDFS 中查找依赖jar 文件 -->
  <property>
    <name>spark.yarn.jars</name>
    <value>${fs.defaultFS}/spark-jars/*</value>
  </property>
  
  <!-- Hive3 和Spark2 连接超时时间 -->
  <property>
    <name>hive.spark.client.connect.timeout</name>
    <value>30000ms</value>
  </property>
<property>
        <name>spark.executor.cores</name>
        <value>1</value>
    </property>
    <property>
        <name>spark.executor.memory</name>
        <value>1g</value>
    </property>
    <property>
        <name>spark.driver.memory</name>
        <value>1g</value>
    </property>
    <property>
        <name>spark.yarn.driver.memoryOverhead</name>
        <value>102</value>
    </property>
<property>
    <name>spark.shuffle.service.enabled</name>
    <value>true</value>
</property>
<property>
    <name>spark.eventLog.enabled</name>
    <value>true</value>
</property
spark-defaults.conf
spark.master=yarn
spark.eventLog.dir=hdfs:///user/spark/applicationHistory
spark.eventLog.enabled=true
spark.executor.memory=1g
spark.driver.memory=1g
spark 依赖库配置
cd /opt/spark-2.4.5-bin-without-hadoop/jars
mv orc-core-1.5.5-nohive.jar orc-core-1.5.5-nohive.jar.bak
//上传jar包到hdfs
hdfs dfs -rm -r -f /spark-jars
hdfs dfs -mkdir /spark-jars
cd /opt/spark-2.4.5-bin-without-hadoop/jars
hdfs dfs -put * /spark-jars
hdfs dfs -ls /spark-jars
hdfs dfs -mkdir /user/spark
hdfs dfs -mkdir /user/spark/applicationHistory
//拷贝jar包到hive
cp scala-compiler-2.11.12.jar scala-library-2.11.12.jar scala-reflect-2.11.12.jar spark-core_2.11-2.4.5.jar spark-network-common_2.11-2.4.5.jar spark-unsafe_2.11-2.4.5.jar spark-yarn_2.11-2.4.5.jar /opt/dtstack/Hive/hive_pkg/lib/
连接beeline进行测试
cd $HIVE_HOME
./bin/beeline -u 'jdbc:hive2://emr1:10000/default'
set hive.execution.engine=spark;
insert into  test1 values(1); 

EB4FC879-09D5-4FEF-B490-6D1291452B59.png



Ps:
问题1:
java.lang.IllegalArgumentException: Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1024 MB) of this cluster
解决问题1:
Hive-site:
<property>
        <name>spark.yarn.executor.memoryOverhead</name>
        <value>4096</value>
    </property>


相关文章

em升级&添加节点实践

em升级&添加节点实践

一、扩容前准备 1.格式化磁盘分区并挂载(1)设置gpt分区表          &nbs...

scylladb通过扩缩容节点迁移数据

环境: Scyllsdb版本:4.2一、上线新节点1、确认集群状态和检查配置· 首先确认集群各节点状态是Up Normal (UN),[root@172-16-121-153 scylla]# nod...

mysql高可用半同步配置(二)

一、配置半同步1.1、部署半同步:#首先判断MySQL服务器是否支持动态增加插件mysql> select @@have_dynamic_loading#确认支持动态增加插件后,检查MySQL的...

Hive优化之Spark执行引擎的参数优化(二)

Hive优化之Spark执行引擎的参数优化(二)

        Hive是大数据领域常用的组件之一,主要是大数据离线数仓的运算,关于Hive的性能调优在日常工作和面试中...

Flume使用案例之实时读取本地文件到HDFS

Flume实时读取本地文件到HDFS1.  创建flume-hdfs.conf文件# 1 agenta2.sources = r2a2.sinks = k2a2.channels = c2 # 2 s...

CDP实操--Ranger Tag-based策略验证(四)

CDP实操--Ranger Tag-based策略验证(四)

1.1Ranger Tag-based策略验证在Ranger webui里给allan_admin和sam_sec用户赋权,给予添加classification的权限使用allan_admin或者sa...

发表评论    

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。