Clickhouse MergeTree异常数据处理
说明
clickhouse mergetree的数据文件如果遇到数据损坏,可能会导致clickhouse无法启动。
本文章说明如何处理这类问题。
测试
我们先人为模拟破坏mergetree数据文件:
detach table:
ck01 :) detach table metrics; DETACH TABLE metrics Query id: bb7f334b-5203-4040-8282-eb45b01b1b72 Ok. 0 rows in set. Elapsed: 0.001 sec.
清空data.mrk文件
root@ck01:/data/clickhouse/clickhouse/data/local/metrics/20221129_12_12_0# ls -l total 36 -r--r----- 1 root root 251 Dec 6 05:37 checksums.txt -r--r----- 1 root root 129 Dec 6 05:37 columns.txt -r--r----- 1 root root 1 Dec 6 05:37 count.txt -r--r----- 1 root root 164 Dec 6 05:37 data.bin -r--r----- 1 root root 176 Dec 6 05:37 data.mrk3 -r--r----- 1 root root 10 Dec 6 05:37 default_compression_codec.txt -r--r----- 1 root root 8 Dec 6 05:37 minmax_tt.idx -r--r----- 1 root root 4 Dec 6 05:37 partition.dat -r--r----- 1 root root 24 Dec 6 05:37 primary.idx root@ck01:/data/clickhouse/clickhouse/data/local/metrics/20221129_12_12_0# echo > data.mrk3
配置参数
max_suspicious_broken_parts默认为10,表示可以容忍10个part数据异常。这里我们将max_suspicious_broken_parts设置为0。
root@ck01:/data/log/clickhouse-server# cd /data/etc/clickhouse-server/config.d/ root@ck01:/data/etc/clickhouse-server/config.d# cat max_suspicious_broken_parts.xml <?xml version="1.0"?> <yandex> <merge_tree> <max_suspicious_broken_parts>0</max_suspicious_broken_parts> </merge_tree> </yandex>
启动实例
2022.12.06 06:50:39.749105 [ 64066 ] {} <Error> Application: DB::Exception: Suspiciously many (1 parts, 592.00 B in total) broken parts to remove while maximum allowed broken parts count is 0. You can change the maximum value with merge tree setting 'max_suspicious_broken_parts' in <merge_tree> configuration section or in table settings in .sql file (don't forget to return setting back to default value): Cannot attach table `local`.`metrics` from metadata file /data/clickhouse/clickhouse/store/8a5/8a567911-82c1-402d-8e61-76dd938e89ef/metrics.sql from query ATTACH TABLE local.metrics UUID 'def88518-fd7b-418d-a7dd-6564e38bba39' (`tt` DateTime, `tags` Map(String, String), `metric` String, `value` Float64, `str_value` String) ENGINE = MergeTree PARTITION BY toYYYYMMDD(tt) ORDER BY (metric, tt) SETTINGS index_granularity = 8192, max_suspicious_broken_parts = 0 2022.12.06 06:50:39.749145 [ 64066 ] {} <Information> Application: shutting down 2022.12.06 06:50:39.749151 [ 64066 ] {} <Debug> Application: Uninitializing subsystem: Logging Subsystem 2022.12.06 06:50:39.749253 [ 64067 ] {} <Information> BaseDaemon: Stop SignalListener thread
这时候,我们发现,实例已经无法正常启动了。
解决方法
方法1:
如果可以接受数据丢失,可以将参数max_suspicious_broken_parts设置得大一些
<?xml version="1.0"?> <yandex> <merge_tree> <max_suspicious_broken_parts>1000</max_suspicious_broken_parts> </merge_tree> </yandex>
方法2:
使用force_restore_data标记文件
touch /data/clickhouse/clickhouse/flags/force_restore_data clickhouse-server --config-file /data/etc/clickhouse-server/config.xml --daemon
启动后,可以从detached_parts表中查看detach的part信息:
ck01 :) select * from detached_parts where table='metrics'; SELECT * FROM detached_parts WHERE table = 'metrics' Query id: c6f61369-259d-4c07-b730-7572c3095bad ┌─database─┬─table───┬─partition_id─┬─name─────────────────────────────┬─disk────┬─reason──────────┬─min_block_number─┬─max_block_number─┬─level─┐ │ local │ metrics │ 20221129 │ broken-on-start_20221129_12_12_0 │ default │ broken-on-start │ 12 │ 12 │ 0 │ └──────────┴─────────┴──────────────┴──────────────────────────────────┴─────────┴─────────────────┴──────────────────┴──────────────────┴───────┘ 1 row in set. Elapsed: 0.002 sec.
异常数据被移到了detached目录
ls detached/ broken-on-start_20221129_12_12_0
如果不允许丢失数据,需要从备份中恢复数据。
可以使用alter table attach part命令从备份中将part恢复出来。