Online Apache HBase Backups with CopyTable

源自:http://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/

CopyTable is a simple Apache HBase utility that, unsurprisingly, can be used for copying individual tables within an HBase cluster or from one HBase cluster to another. In this blog post, we’ll talk about what this tool is, why you would want to use it, how to use it, and some common configuration caveats.

Use cases:

CopyTable is at its core an Apache Hadoop MapReduce job that uses the standard HBase Scan read-path interface to read records from an individual table and writes them to another table (possibly on a separate cluster) using the standard HBase Put write-path interface. It can be used for many purposes:

  • Internal copy of a table (Poor man’s snapshot)

  • Remote HBase instance backup

  • Incremental HBase table copies

  • Partial HBase table copies and HBase table schema changes

Assumptions and limitations:

The CopyTable tool has some basic assumptions and limitations. First, if being used in the multi-cluster situation, both clusters must be online and the target instance needs to have the target table present with the same column families defined as the source table.

Since the tool uses standards scans and puts, the target cluster doesn’t have to have the same number of nodes or regions.  In fact, it can have different numbers of tables, different numbers of region servers, and could have completely different region split boundaries. Since we are copying entire tables, you can use performance optimization settings like setting larger scanner caching values for more efficiency. Using the put interface also means that copies can be made between clusters of different minor versions. (0.90.4 -> 0.90.6, CDH3u3 -> CDH3u4) or versions that are wire compatible (0.92.1 -> 0.94.0).

Finally, HBase only provides row-level ACID guarantees; this means while a CopyTable is going on, newly inserted or updated rows may occur and these concurrent edits will either be completely included or completely excluded. While rows will be consistent, there is no guarantees about the consistency, causality, or order of puts on the other rows.

Internal copy of a table (Poor man’s snapshot)

Versions of HBase up to and including the most recent 0.94.x versions do not support table snapshotting. Despite HBase’s ACID limitations, CopyTable can be used as a naive snapshotting mechanism that makes a physical copy of a particular table.

Let’s say that we have a table, tableOrig with column-families cf1 and cf2. We want to copy all its data to tableCopy. We need to first create tableCopy with the same column families:

1

srcCluster$ echo "create 'tableOrig', 'cf1', 'cf2'" | hbase shell

We can then create and copy the table with a new name on the same HBase instance:

1

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy tableOrig

This starts an MR job that will copy the data.

Remote HBase instance backup

Let’s say we want to copy data to another cluster. This could be a one-off backup, a periodic job or could be for bootstrapping for cross-cluster replication. In this example, we’ll have two separate clusters: srcCluster and dstCluster.

In this multi-cluster case, CopyTable is a push process — your source will be the HBase instance your current hbase-site.xml refers to and the added arguments point to the destination cluster and table. This also assumes that all of the MR TaskTrackers can access all the HBase and ZK nodes in the destination cluster. This mechanism for configuration also means that you could run this as a job on a remote cluster by overriding the hbase/mr configs to use settings from any accessible remote cluster and specify the ZK nodes in the destination cluster. This could be useful if you wanted to copy data from an HBase cluster with lower SLAs and didn’t want to run MR jobs on them directly.

You will use the the –peer.adr setting to specify the destination cluster’s ZK ensemble (e.g. the cluster you are copying to). For this we need the ZK quorum’s IP and port as well as the HBase root ZK node for our HBase instance. Let’s say one of these machine is srcClusterZK (listed in hbase.zookeeper.quorum) and that we are using the default zk client port 2181 (hbase.zookeeper.property.clientPort) and the default ZK znode parent /hbase (zookeeper.znode.parent). (Note: If you had two HBase instances using the same ZK, you’d need a different zookeeper.znode.parent for each cluster.

1

2

3

4

5

# create new tableOrig on destination cluster

dstCluster$ echo "create 'tableOrig', 'cf1', 'cf2'" | hbase shell

# on source cluster run copy table with destination ZK quorum specified using --peer.adr

# WARNING: In older versions, you are not alerted about any typo in these arguments!

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase tableOrig

Note that you can use the –new.name argument with the –peer.adr to copy to a differently named table on the dstCluster.

1

2

3

4

# create new tableCopy on destination cluster

dstCluster$ echo "create 'tableCopy', 'cf1', 'cf2'" | hbase shell

# on source cluster run copy table with destination --peer.adr and --new.name arguments.

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase --new.name=tableCopy tableOrig

This will copy data from tableOrig on the srcCluster to the dstCluster’s tableCopy table.

Incremental HBase table copies

Once you have a copy of a table on a destination cluster, how do you do copy new data that is later written to the source cluster? Naively, you could run the CopyTable job again and copy over the entire table. However, CopyTable provides a more efficient incremental copy mechanism that just copies the updated rows from the srcCluster to the backup dstCluster specified in a window of time. Thus, after the initial copy, you could then have a periodic cron job that copies data from only the previous hour from srcCluster to the dstCuster.

This is done by specifying the –starttime and –endtime arguments. Times are specified as decimal milliseconds since unix epoch time.

1

2

3

4

5

6

7

8

# WARNING: In older versions, you are not alerted about any typo in these arguments!

# copy from beginning of time until timeEnd 

# NOTE: Must include start time for end time to be respected. start time cannot be 0.

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable ... --starttime=1 --endtime=timeEnd ...

# Copy from starting from and including timeStart until the end of time.

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable ... --starttime=timeStart ...

# Copy entries rows with start time1 including time1 and ending at timeStart excluding timeEnd.

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable ... --starttime=timestart --endtime=timeEnd

Partial HBase table copies and HBase table schema changes

By default, CopyTable will copy all column families from matching rows. CopyTable provides options for only copying data from specific column-families. This could be useful for copying original source data and excluding derived data column families that are added by follow on processing.

By adding these arguments we only copy data from the specified column families.

  • –families=srcCf1

  • –families=srcCf1,srcCf2

Starting from 0.92.0 you can copy while changing the column family name:

  • –families=srcCf1:dstCf1

    • copy from srcCf1 to dstCf1 

  • –families=srcCf1:dstCf1,dstCf2,srcCf3:dstCf3

    • copy from srcCf1 to destCf1, copy dstCf2 to dstCf2 (no rename), and srcCf3 to dstCf3

Please note that dstCf* must be present in the dstCluster table!

Starting from 0.94.0 new options are offered to copy delete markers and to include a limited number of overwritten versions. Previously, if a row is deleted in the source cluster, the delete would not be copied — instead that a stale version of that row would remain in the destination cluster. This takes advantage of some of the 0.94.0 release’s advanced features.

  • –versions=vers

    • where vers is the number of cell versions to copy (default is 1 aka the latest only)

  • –all.cells 

    • also copy delete markers and deleted cells

Common Pitfalls

The HBase client in the 0.90.x, 0.92.x, and 0.94.x versions always use zoo.cfg if it is in the classpath, even if an hbase-site.xml file specifies other ZooKeeper quorum configuration settings. This “feature” causes a problem common in CDH3 HBase because its packages default to including a directory where zoo.cfg lives in HBase’s classpath. This can and has lead to frustration when trying to use CopyTable (HBASE-4614). The workaround for this is to exclude the zoo.cfg file from your HBase’s classpath and to specify ZooKeeper configuration properties in your hbase-site.xml file. http://hbase.apache.org/book.html#zookeeper

Conclusion

CopyTable provides simple but effective disaster recovery insurance for HBase 0.90.x (CDH3) deployments. In conjunction with the replication feature found and supported in CDH4’s HBase 0.92.x based HBase, CopyTable’s incremental features become less valuable but its core functionality is important for bootstrapping a replicated table. While more advanced features such as HBase snapshots (HBASE-50) may aid with disaster recovery when it gets implemented, CopyTable will still be a useful tool for the HBase administrator.


内容概要:本文详细介绍了Maven的下载、安装与配置方法。Maven是基于项目对象模型(POM)的概念,用于项目管理和构建自动化的工具,能有效管理项目依赖、规范项目结构并提供标准化的构建流程。文章首先简述了Maven的功能特点及其重要性,接着列出了系统要求,包括操作系统、磁盘空间等。随后,分别针对Windows、macOS和Linux系统的用户提供了详细的下载和安装指导,涵盖了解压安装包、配置环境变量的具体操作。此外,还讲解了如何配置本地仓库和镜像源(如阿里云),以优化依赖项的下载速度。最后,给出了常见的错误解决方案,如环境变量配置错误、JDK版本不兼容等问题的处理方法。 适合人群:适用于初学者以及有一定经验的Java开发人员,特别是那些希望提升项目构建和依赖管理效率的技术人员。 使用场景及目标: ①帮助开发者掌握Maven的基本概念和功能特性; ②指导用户完成Maven在不同操作系统上的安装与配置; ③教会用户如何配置本地仓库和镜像源以加快依赖项下载; ④解决常见的安装和配置过程中遇到的问题。 阅读建议:由于Maven的安装和配置涉及多个步骤,建议读者按照文中提供的顺序逐步操作,并仔细检查每个环节的细节,尤其是环境变量的配置。同时,在遇到问题时,可参考文末提供的常见问题解决方案,确保顺利完成整个配置过程。
资源下载链接为: https://pan.quark.cn/s/abbae039bf2a 旅行商问题(Traveling Salesman Problem, TSP)是一种经典的组合优化问题,目标是找到一条最短路径,让推销员访问一系列城市后返回起点,且每个城市只访问一次。该问题可以转化为图论问题,其中城市是节点,城市间的距离是边的权重。遗传算法是一种适合解决TSP这类NP难问题的全局优化方法,其核心是模拟生物进化过程,包括初始化、选择、交叉和变异等步骤。 初始化:生成初始种群,每个个体(染色体)表示一种旅行路径,通常用随机序列表示,如1到18的整数序列。 适应度计算:适应度函数用于衡量染色体的优劣,即路径总距离。总距离越小,适应度越高。 选择过程:采用轮盘赌选择机制,根据适应度以一定概率选择个体进入下一代,适应度高的个体被选中的概率更大。 交叉操作:一般采用单点交叉,随机选择交叉点,交换两个父代个体的部分基因段生成子代。 变异操作:采用均匀多点变异,随机选择多个点进行变异,变异点的新值在预设范围内随机生成,以维持种群多样性。 反Grefenstette编码:为确保解的可行性,需将变异后的Grefenstette编码转换回原始城市序列,即对交叉和变异结果进行反向处理。 迭代优化:重复上述步骤,直至满足终止条件,如达到预设代数或适应度阈值。 MATLAB是一种强大的数值和科学计算工具,非常适合实现遗传算法。通过编写源程序,可以构建遗传算法框架,处理TSP问题的细节,包括数据结构定义、算法流程控制以及适应度计算、选择、交叉和变异操作的实现。遗传算法虽不能保证找到最优解,但在小规模TSP问题中能提供不错的近似解。对于大规模TSP问题,可结合局部搜索、多算法融合等策略提升解的质量。在实际应用中,遗传算法常与其他优化方法结合,用于解决复杂的调度和路径规划问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值