使用clickhouse-copier迁移数据

俊达3年前 (2022-12-29)技术文章5464

说明

clickhouse-copier是clickhouse官方提供的一个数据迁移工具。支持将clickhouse表从一个集群迁移到另外一个集群。

使用clickhouse-copier有一些限制条件：

迁移过程中，源端表不能进行数据写入。如果迁移的过程中源表有数据写入，则迁移的数据可能会不一致。

目标库需要先建好。目标表可以自动创建

如果目标表是复制表（replicated)，建议提前在所有的副本上创建表结构

clickhouse-copier有一个优点：源集群和目标集群在架构上可以有差异。比如目标集群的分片数可以和源端不一样。可以通过clickhouse-copier来实现数据的重新平衡。

使用说明

官方说明文档比较精简：https://clickhouse.com/docs/en/operations/utilities/clickhouse-copier

使用clickhouse-copier需要准备两个配置文件：

keeper.xml: zookeeper连接信息。clickhouse-copier通过zookeeper来协调数据迁移任务，更新状态。

task.xml: 迁移任务。用于配置迁移任务的具体信息，包括

源集群信息
目标集群信息
源表信息
目标表信息
任务并发度
其他参数

keeper.xml

clickhouse-copier连接的zookeeper信息。这个zookeeper和源端和目标端的clickhouse-server连接的zookeeper没有关系。

<clickhouse>
    <logger>
        <level>trace</level>
        <size>100M</size>
        <count>3</count>
    </logger>

    <zookeeper>
        <node index="1">
            <host>ck03</host>
            <port>2181</port>
        </node>
    </zookeeper>
</clickhouse>

task.xml

迁移任务信息

<clickhouse>
<!-- Configuration of clusters as in an ordinary server config -->
<remote_servers>
<source_cluster>
	<secret>foo</secret>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>172.16.121.248</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>172.16.121.48</host>
<port>9000</port>
</replica>
</shard>
</source_cluster>
<dest_cluster>
	<secret>foo</secret>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>172.16.121.67</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>false</internal_replication>
<replica>
<host>172.16.121.243</host>
<port>9000</port>
</replica>
</shard>
</dest_cluster>
</remote_servers>
<!-- How many simultaneously active workers are possible. If you run more workers superfluous workers will sleep. -->
<max_workers>1</max_workers>
<!-- Setting used to fetch (pull) data from source cluster tables -->
<settings_pull>
<readonly>1</readonly>
</settings_pull>
<!-- Setting used to insert (push) data to destination cluster tables -->
<settings_push>
<readonly>0</readonly>
</settings_push>
<!-- Common setting for fetch (pull) and insert (push) operations. Also, copier process context uses it.
They are overlaid by <settings_pull/> and <settings_push/> respectively. -->
<settings>
<connect_timeout>3</connect_timeout>
<!-- Sync insert is set forcibly, leave it here just in case. -->
<insert_distributed_sync>1</insert_distributed_sync>
</settings>
<tables>
<!-- A table task, copies one table. -->
<table_1>
<!-- Source cluster name (from <remote_servers/> section) and tables in it that should be copied -->
<cluster_pull>source_cluster</cluster_pull>
<database_pull>src_db</database_pull>
<table_pull>oracle_sql_stat</table_pull>
<!-- Destination cluster name and tables in which the data should be inserted -->
<cluster_push>dest_cluster</cluster_push>
<database_push>dest_db</database_push>
<table_push>oracle_sql_stat</table_push>
	<engine>
ENGINE=ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/oracle_sql_stat_new_123', '{replica}')
PARTITION BY toString(_at_date_)
	ORDER BY (_at_date_, _at_timestamp_, intHash64(toInt64(toDateTime(_at_timestamp_))))
</engine>
	<number_of_splits>1</number_of_splits>
<!-- Sharding key used to insert data to destination cluster -->
<sharding_key>jumpConsistentHash(intHash64(toInt64(timestamp)), 2)</sharding_key>
	<enabled_partitions>
<partition>'2022-09-23'</partition>
</enabled_partitions>
</table_1>
</tables>
</clickhouse>

这个配置文件的内容非常重要，配置不当会导致迁移任务失败、甚至引起数据误删除。

关键配置：

remote_servers。remote_servers下定义集群信息。配置格式和clickhouse配置文件中的集群配置一样。

tables：配置具体需要迁移的表。每个待迁移的表都需要配置相关信息。

cluster_pull: 源端集群。copier会从源端集群的每一个shard中获取需要迁移的数据。
database_pull：源数据库
table_pull: 源表
cluster_push：目标集群。copier会根据目标集群的shard数，拆分数据。
database_push: 目标库
table_push: 目标表。目标表的表名、排序字段等可以和源表不一致。但是字段必须一样。
engine：目标表的存储引擎。目标表如果不存在，copier会自动创建表。如果目标表存在，则表的实际engine定义必须和配置文件中的保持一致。
sharding_key：目标数据拆分规则。
enabled_partitions：需要同步的分区列表。如果不填，则默认同步所有分区的数据。

clickhouse-copier命令行

使用clickhouse-copier来启动数据迁移任务：

clickhouse-copier --config keeper.xml --task-path /clickhouse-copier/tasks_1001 --task-file task.xml

copier命令行关键参数：

config: zookeeper配置文件

task-path: 存储迁移任务的zookeeper路径。需要为每个迁移任务指定单独的zookeeper路径。

task-file: 如果命令行传入该参数，则会自动将任务写入到对应的路径(task-path/description)。如果不指定task-file，可以使用zkCli.sh将task.xml写入到zookeeper指定路径。

可以在zkCli.sh中通过 get /task-path/description 查看任务信息

使用clickhouse-copier迁移增量数据

clickhouse-copier可以通过指定partiton的方式来迁移数据。

建议的迁移步骤：

1、清空目标表指定分区的数据。（使用alter table tabname drop partiton命令)

2、创建任务，迁移指定分区的数据。（task.xml中配置enabled_partitions)

3、验证数据。校验表的记录数(select count(*) from tabname group by partition_columns)。

注意事项

1、建议将copier部署在源表所在服务器，减少从源表获取数据的网络开销

2、如果源表在多个shard上，存在数据完全一样的part，迁移任务可能会报错

如果源表在多个shard上，存在数据完全一样的part，迁移过程中可能会报如下错误：

Application: Code: 999. DB::Exception: Received from 172.16.121.67:9000.
 Coordination::Exception. Coordination::Exception: 
 Transaction failed (Node exists):
  Op #4, path: /clickhouse/tables/dest-cluster/3/oracle_sql_stat_new_123/blocks/16c630bb756b97c9b78756f85b8c026a_replace_from_7D81C37BD16D50E6781B37A0187E5C35. Stack trace:

这个报错由迁移数据的过程中执行的下列SQL引起：

ALTER TABLE dest_db.oracle_sql_stat ATTACH PARTITION '2022-09-23' FROM dest_db.oracle_sql_stat_piece_0

3、迁移过程中，如果源表发生数据变化，则迁移的数据可能会和源表不一致

4、如果目标表在迁移前已经有部分数据，则迁移后可能会发生几种情况：

原有数据被清空，被新的数据替换
原有数据不变，新的数据也会迁移进来。

5、源表数据不变的情况下，同一个表迁移多次，数据可能和迁移一次效果一样。

返回列表

上一篇：如何重塑IT运维核心竞争力？可观测运维这么做！

下一篇：docker日志管理

Kubernetes 调度策略

1、资源调度创建一个新的 Pod 资源对象时，该 Pod 应该运行在哪个工作节点，由 Scheduler 调度器进行调度。Scheduler 是一个守护进程，其内部有很多调度算法。Pod 调度流程：当...

Doris部署介绍

标准部署该文档主要介绍了部署 Doris 所需软硬件环境、建议的部署方式、集群扩容缩容，以及集群搭建到运行过程中的常见问题。在阅读本文档前，请先根据编译文档编译 Doris。软硬件需求概述Doris...

开源大数据集群部署（五）Freeipa客户端部署及配置（kerberos+ldap）

1、安装IPA-Client安装（每个节点均需安装）yum install ipa-client -y2、执行安装命令ipa-client-install3、执行完成后进入交互阶段4、安装完成后查看页...

Redis 运维规范_运维管理规范

三、运维管理规范1、密码认证云上 Redis 的权限控制：账号管理+白名单设置+阿里云子账号权限。对于线下 Redis 可以通过设置密码和 bind 参数文件控制访问。2、合理设置备份策略 Redi...

服务器cpu异常抖动排查

问题背景当天下午某某客户服务器cpu使用率80%报警持续10分钟左右后又恢复了，联系客户排查，有出现不同的应用服务器都有收到cpu报警，而且出现这样的抖动不止一次，每隔一段时间就会出现类似...

Hive与HBase整合文档

1.Hive整合HBase配置1.拷贝hbase 相关jar包将hbase 相关jar包拷贝到hive lib目录下hbase-client-0.98.13-hadoop2.jar hbase-co...

使用clickhouse-copier迁移数据

说明

使用说明

keeper.xml

task.xml

使用clickhouse-copier迁移增量数据

注意事项

相关文章

Kubernetes 调度策略

Doris部署介绍

开源大数据集群部署（五）Freeipa客户端部署及配置（kerberos+ldap）

Redis 运维规范_运维管理规范

服务器cpu异常抖动排查

Hive与HBase整合文档

发表评论

©Copyrights 2016-2022 YUNCHE 浙ICP备2021017017号

使用clickhouse-copier迁移数据

说明

使用说明

keeper.xml

task.xml

使用clickhouse-copier迁移增量数据

注意事项

相关文章

Kubernetes 调度策略

Doris部署介绍

开源大数据集群部署（五）Freeipa客户端部署及配置（kerberos+ldap）

Redis 运维规范_运维管理规范

服务器cpu异常抖动排查

Hive与HBase整合文档

发表评论 取消回复

©Copyrights 2016-2022 YUNCHE 浙ICP备2021017017号var _hmt = _hmt || [];(function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?dcf8139ce75b768b71dccc5e589b983c"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s);})();

发表评论

©Copyrights 2016-2022 YUNCHE 浙ICP备2021017017号