Hadoop3.2.0 Hadoop 机架感知

本文详细阐述了Hadoop如何利用机架感知功能提高数据可用性,通过配置文件管理机架ID,以及使用Java类或外部脚本进行拓扑映射。重点讲解了这两种方式的区别,以及未配置可能导致的问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Hadoop组件具有机架感知功能。例如,通过将一个块的分片放在不同的机架上,HDFS块放置将使用机架感知来实现容错。这可以在群集中发生网络切换故障或分区时提供数据可用性。

Hadoop主守护进程通过调用配置文件指定的外部脚本或java类来获取集群工作者的机架ID。使用java类或外部脚本进行拓扑,输出必须遵循javaorg.apache.hadoop.net.DNSToSwitchMapping接口。接口需要保持一对一的对应关系,并且拓扑信息的格式为’/ myrack/myhost’,其中’/'是拓扑定界符,'myrack’是机架标识符,‘myhost’是个人主持人。假设每个机架有一个/ 24个子网,可以使用’/192.168.100.0/192.168.100.5’格式作为唯一的机架 - 主机拓扑映射。

要使用java类进行拓扑映射,类名由配置文件中的net.topology.node.switch.mapping.impl参数指定。例如,NetworkTopology.java包含在hadoop发行版中,可以由Hadoop管理员自定义。使用Java类而不是外部脚本具有性能优势,因为当新的工作节点注册自身时,Hadoop不需要派生外部进程。

如果实现外部脚本,将使用配置文件中的net.topology.script.file.name参数指定它。与java类不同,外部拓扑脚本不包含在Hadoop发行版中,由管理员提供。在分叉拓扑脚本时,Hadoop将向ARGV发送多个IP地址。发送到拓扑脚本的IP地址数由net.topology.script.number.args控制,默认为100.如果net.topology.script.number.args更改为1,则每个拓扑脚本都会分叉DataNodes和/或NodeManagers提交的IP。

如果未设置net.topology.script.file.namenet.topology.node.switch.mapping.impl,则会为任何传递的IP地址返回机架ID’/ default-rack’。虽然这种行为似乎是可取的,但它可能会导致HDFS块复制出现问题,因为默认行为是将一个复制块写入机架而无法执行此操作,因为只有一个名为“/ default-rack”的机架。

python Example

#!/usr/bin/python
# this script makes assumptions about the physical environment.
#  1) each rack is its own layer 3 network with a /24 subnet, which
# could be typical where each rack has its own
#     switch with uplinks to a central core router.
#
#             +-----------+
#             |core router|
#             +-----------+
#            /             

±----------+ ±----------+

#   |rack switch|        |rack switch|
#   +-----------+        +-----------+
#   | data node |        | data node |
#   +-----------+        +-----------+
#   | data node |        | data node |
#   +-----------+        +-----------+
#
# 2) topology script gets list of IP's as input, calculates network address, and prints '/network_address/ip'.

import netaddr
import sys
sys.argv.pop(0)                                                  # discard name of topology script from argv list as we just want IP addresses

netmask = '255.255.255.0'                                        # set netmask to what's being used in your environment.  The example uses a /24

for ip in sys.argv:                                              # loop over list of datanode IP's
    address = '{0}/{1}'.format(ip, netmask)                      # format address string so it looks like 'ip/netmask' to make netaddr work
    try:
        network_address = netaddr.IPNetwork(address).network     # calculate and print network address
        print "/{0}".format(network_address)
    except:
        print "/rack-unknown"                                    # print catch-all value if unable to calculate network address

bash Example

#!/usr/bin/env bash
# Here's a bash example to show just how simple these scripts can be
# Assuming we have flat network with everything on a single switch, we can fake a rack topology.
# This could occur in a lab environment where we have limited nodes,like 2-8 physical machines on a unmanaged switch.
# This may also apply to multiple virtual machines running on the same physical hardware.
# The number of machines isn't important, but that we are trying to fake a network topology when there isn't one.
#
#       +----------+    +--------+
#       |jobtracker|    |datanode|
#       +----------+    +--------+
#                      /
#  +--------+  +--------+  +--------+
#  |datanode|--| switch |--|datanode|
#  +--------+  +--------+  +--------+
#              /        

±-------+ ±-------+

#       |datanode|    |namenode|
#       +--------+    +--------+
#
# With this network topology, we are treating each host as a rack.  This is being done by taking the last octet
# in the datanode's IP and prepending it with the word '/rack-'.  The advantage for doing this is so HDFS
# can create its 'off-rack' block copy.
# 1) 'echo $@' will echo all ARGV values to xargs.
# 2) 'xargs' will enforce that we print a single argv value per line
# 3) 'awk' will split fields on dots and append the last field to the string '/rack-'. If awk
#    fails to split on four dots, it will still print '/rack-' last field value

echo $@ | xargs -n 1 | awk -F '.' '{print "/rack-"$NF}'

原文链接:https://hadoop.apache.org/docs/r3.2.0/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值