Troubleshooting Network Teaming Problems in ESX/ESXi

本文通过一个具体的案例,详细介绍了如何排查虚拟机网络连接问题。从收集基本信息开始,逐步深入到物理网络适配器和交换机配置层面,最终定位并解决了一个由物理交换机端口配置不当引起的问题。

Perhaps one of the most common types of problems we encounter here at VMware Technical Support is relating to loss of network connectivity to one or more virtual machines on a host. As vague and simple as that description is, it may not always be clear where to even start looking for a solution.

Networking in vSphere is a very far reaching topic with many layers. You’ve got your virtual machines, their respective network stacks, Standard vSwitches, Distributed Switches, different load balancing types, physical network adapters, their respective drivers, the vmkernel itself and we haven’t even begun to mention what lies outside of the host yet.

It may be tempting to quickly draw conclusions and dive right into packet captures and even guest operating system troubleshooting, but knowing what questions to ask can make all the difference when trying to narrow down the problem. More often than not, it winds up being something quite simple.

When it comes to troubleshooting virtual machine network connectivity, the best place to start is to simply gather information relating to the problem – what works and what doesn’t. Just because you can’t ping something does not mean the virtual machine is completely isolated. It’s always best not to draw conclusions and to be as logical and methodical as possible. Get a better view of the whole picture, and then narrow down in areas where it makes sense.

Some initial discovery troubleshooting I’d recommend:

  • Is this impacting all virtual machines, or a subset of virtual machines?
  • If you move this VM to another host using vMotion or Cold Migration, does the issue persist?
  • Is there anything in common between the virtual machines having a problem? I.e. are they all in a specific VLAN?
  • Are the virtual machines able to communicate with each other on the same vSwitch and Port Group? Or is this a complete loss of network connectivity to the troubled VMs?
  • Have you been able to regain connectivity by doing anything? Or is it persistent?

Let me walk you through an example of a common problem in a simple vSphere 5.0 environment utilizing vNetwork Standard Switches. In this example, we have four Ubuntu Server virtual machines on a single ESXi 5.0 host. Three of them are not exhibiting any problems, but for some unknown reason, one of them, ubuntu1, appears to have no network connectivity.

ubuntu network

First, we’ll want to have a quick look at the host’s network configuration from the vSphere Client. As you can see below, we have four virtual machines spread across two VLANs – half of them in VLAN 5 and the other half in VLAN 6. Two gigabit network adapters are being used as uplinks in this vSwitch.

Based on this simple vSwitch depiction, we can draw several conclusions that may come in handy during our troubleshooting. First, we can see that VLAN IDs are specified for the two port groups. This indicates that we are doing VST or Virtual Switch Tagging. For this to work correctly, vmnic1 and vmnic2 in the network team must be configured for VLAN trunking using the 802.1q protocol on the physical switch. We’ll keep this information in the back of our heads for now and move on.

Next, since we have a pair of network adapters in the team, we need to determine the load balancing type employed (this could change the troubleshooting approach quite a bit, so you’ll want to know this right up front). As seen below, we were able to determine that the default load balancing method – Route based on the originating virtual port ID – is being used based on the vSwitch properties.

With this, we know several things about the way the network should be configured.
From a virtual switch perspective, we know that we’ll have a spread of virtual machines across both network adapters as both of the uplinks are in an Active state. We’ll have a one-to-one mapping of virtual machine network adapters to physical adapters. Assuming each virtual machine has only a single virtual network card, each virtual machine will be bound to a single physical adapter on the ESXi host. In theory, we should have a rough 50/50 split of virtual machines across both physical adapters in the team. It may not be a perfect 50/50 split across all VMs, but any single VM on this vSwitch has an equal chance of being on vmnic1 or vmnic2.

From the physical switch, each of these vmnics should be connecting to independent and identically configured 802.1q VLAN trunk ports – this is important. With the default ‘Route based on originating virtual port ID’ load balancing type, the physical switch should not be configured for link aggregation (802.3ad, also known as etherchannel in the Cisco world, or Trunks in the HP world). Link aggregation or bonding should be used only with IP hash load balancing type.

With this small amount of information, we are able to determine quite a bit about the way the environment should be configured in both ESXi and on the physical switch, and how the load balancing should be behaving.

To recap:

  1. Virtual machines vNICs will be bound in a one to one mapping with physical network adapters on the host.
  2. We are doing virtual switch tagging, so the virtual machines should not have any VLANs configured within the guest operating system.
  3. On the physical switch, vmnic1 and vmnic2 should connect to two independent switch ports configured identically as 802.1q VLAN trunks.
  4. Each physical switch port must be configured to allow both VLAN 5 and VLAN 6 as we are tagging both of these VLANs within the vSwitch.
  5. An etherchannel or port-channel should not be configured for these two switch ports.

From here, we should now try to determine more about the virtual machine’s problem. We’ll start by doing some basic ICMP ping testing between various devices we know to be online:

From ubuntu1:

  • Pinging the default gateway: Fails.
  • Pinging another virtual machine in VLAN5 on another ESXi host: Fails.
  • Pinging Ubuntu2 on the same host: Fails.
  • Pinging Ubuntu4 on the same host: Succeeds.

This very simple test provides us with a wealth of additional information that will help us narrow things down. From the five pings we just ran, we can determine the following:

Pinging the default gateway fails

Because other VMs, including Ubuntu4 can ping the default gateway, we know that it is responding to ICMP requests. A VM’s default gateway should always be in the same subnet/VLAN, so in this case, the core switch’s interface being pinged is also in VLAN 5, so no routing is being done to access it. If the VM can’t communicate with its gateway, we know that anything outside of VLAN5 will not be able to communicate with it due to loss of routing capability.

Pinging another virtual machine in VLAN5 on another ESXi host

This confirms that the VM is unable to communicate with anything on the physical network and that it is not just a VM to gateway communication problem. Since this target VM being pinged is also in VLAN 5, no routing is required and this appears to be a lower-level problem.

Pinging Ubuntu2 on the same host

Even though these two VMs are on the same host, they are in different VLANs, and different IP subnets. vSwitches are layer-2 only and do not perform any routing. Communication between VLANs on the same vSwitch requires routing and all traffic would have to go out to a router on the physical network for this to work. Since the ubuntu1 VM can’t reach it’s gateway on the physical network, this will obviously not work. Be careful when pinging between VMs on the same host – if they are not in the same VLAN, routing will be required and all traffic between VMs will go out and back in via the physical adapters.

Pinging Ubuntu4 on the same host

Pinging ubuntu4 in the same VLAN, in the same IP subnet and in the same vSwitch works correctly. This piece of information is very useful because it confirms that the VM’s network stack in the guest operating system is indeed working to some extent. Because ubuntu1 and ubuntu4 share the same portgroup, all communication remains within the vSwitch and does not need to traverse the physical network.

But why is ubuntu1 not working correctly, and ubuntu4 is? This is a rhetorical question by the way. Let’s think back to what we saw earlier – we have two physical NICs, and four virtual machines employing ‘Route based on the originating virtual port ID’ load balancing. The ubuntu1 VM cannot communicate on the physical network, but can communicate on the vSwitch. Clearly something is preventing this virtual machine from communicating out to the physical network.

Let’s recap what we’ve learned to this point:

  • We know the VM’s guest operating system networking stack is working correctly
  • We know that the vSwitch is configured correctly, as ubuntu4 works fine and is in the same VLAN and Port Group.
  • The problem lies in ubuntu1’s inability to communicate out to the physical network.

When thinking about this problem from an end-to-end perspective, the next logical place to look would be the physical network adapters. We know that we have two adapters in a NIC team, but which NIC is ubuntu1 actually using? Which NIC is ubuntu4 using? To determine this, we’ll need to connect to the host using SSH or via the Local Tech Support Mode console and use a tool called esxtop. From within esxtop, we simply hit ‘n’ for the networking view and immediately obtain some very interesting information:

As you can see above, this view tells us exactly which physical network adapter each virtual machine is currently bound to. There is a very clear pattern that we can see almost immediately – ubuntu1 is the only virtual machine currently utilizing vmnic1. It is unlikely that this is a mere coincidence, but most vSphere administrators may not be able to approach their network teams without more information.

Assuming there really is a problem with vmnic1 or its associated physical switch port, we would expect the VM to regain connectivity if we forced it to use vmnic2 instead. Let’s confirm this theory.

The easiest way to accomplish this would be to create a temporary port group identical to VM VLAN 5, but configured to use only vmnic2. That way when we put ubuntu1’s virtual NIC into that port group, it will have no choice but to use vmnic2. In our example, I created a new port group called ‘vmnic2 forced’:

After creating this new temporary port group, you simply need to check the ‘Override switch failover order’ checkbox in the NIC Teaming tab and ensure that only the desired adapter is listed as Active. All other adapters should be moved down to Unused. In our case, vmnic2 is now the only active adapter associated with this port group called ‘vmnic2 forced’. We can now edit the settings of ubuntu1 and configure its virtual network card to connect to this new port group:

Immediately after making this change – even if the VM is up and running – we should see esxtop reflect the new configuration:

As you can see above, ubuntu1 is now forced on vmnic2, consistent with the other VMs that are not having a problem.

We can then repeat our ping tests:

From ubuntu1:

  • Pinging the default gateway: Success.
  • Pinging another virtual machine in VLAN5 on another ESXi host: Success.
  • Pinging Ubuntu2 on the same host: Success.
  • Pinging Ubuntu4 on the same host: Succeeds.

And there you have it. We’ve essentially proven that there is some kind of a problem communicating out of vmnic1. Moving further along the communication path, it would now be a good idea to examine the physical switch configuration for these two ports associated with vmnic1 and vmnic2. For those of you who work in a large corporation, I’d say you would have enough proof to go to your network administration team to present a case for investigation on the physical side at this point.

In this example, our ESXi 5.0 host is connected to a single Cisco 2960G gigabit switch and we’ll assume that we’ve been able to physically confirm that vmnic1 plugs into port g1/0/10 and that vmnic2 plugs into port g1/0/11. Finding and confirming the correct upstream physical switch ports is outside of the scope of this example today, but we’ll just assume we’re 100% certain.


  
 
  

Even if you are not familiar with Cisco IOS commands and switch port configuration, it isn’t difficult to see that there is indeed a difference between g1/0/10 and g1/0/11. It appears that only g1/0/11 is configured to allow VLAN 5. So this problem may have gone unnoticed for some time as it would only impact virtual machines in VLAN 5, and there is about a 50/50 chance they would fall on vmnic1. To make matters worse, certain operations may actually cause a troubled VM to switch from vmnic1 to vmnic2, which can sometimes add to the confusion.

It is not uncommon that vSphere administrators try several things to get the VM to come online again. Some of the actions that may cause the VM to become bound to a different vmnic include:

  • vMotion to another host, and then back again.
  • Powering off, and then powering on the VM again.
  • Removing a virtual network card and then adding it back – even if it’s the same or different type.

Any of the above tasks give you about a 50% chance connectivity will be restored when dealing with two physical network cards. You may try one of the above actions and the problem persists – you may try it again and things start to work again. This will often give administrators the false assumption that it was just something quirky with the VM’s virtual network adapter, or a problem with ESXi itself. This is why taking a methodical approach is ideal as it will eventually lead you to the true cause. Remember, each physical switch port associated with a network team should always be configured identically.

Well, that’s it for today. I’m hoping to provide some other example scenarios that lead to other root problems, or ones that deal with different load balancing types. Thanks for reading.

This entry was posted in  Datacenter, From the Trenches, How-to by Mike Da Costa. Bookmark the permalink .
<think>嗯,用户遇到了ESP芯片的启动模式错误(0x29),想进入下载模式但遇到了问题。我需要根据提供的引用资料和系统指令来生成解决方案和相关问题。 首先,根据系统指令,要确保数学公式正确使用$...$和$$...$$,但这个问题可能不涉及太多数学内容。不过要注意中文回答,结构清晰。用户提到的错误代码是0x29,这在ESP芯片中通常与启动模式有关。根据引用资料中的信息,特别是引用[2]和[4],可能需要检查硬件连接和固件配置。 用户的问题是关于启动模式错误和下载模式的问题,可能需要检查GPIO引脚的状态,比如GPIO0、GPIO2、GPIO15等是否在启动时正确拉高或拉低。例如,引用[3]提到GPIO2需要连接到GND来下载,可能是在某些型号上需要注意这一点。此外,串口通信的问题也可能导致错误,如引用[4]提到的串行数据流停止,可能是线路干扰或接触不良。 接下来,需要整理解决方案步骤。首先检查硬件连接,确保GPIO引脚正确配置,然后检查串口驱动和工具配置,比如esptool的参数是否正确。如果问题依旧,可能需要尝试不同的复位方法或检查电源稳定性。最后,考虑硬件故障的可能性,比如芯片损坏或焊接问题。 在生成相关问题时,需要围绕启动模式错误、下载模式进入方法、常见错误代码和esptool使用技巧等方面展开,确保问题覆盖用户可能关心的其他方面。</think>### ESP芯片启动模式错误(0x29)解决方案 #### 1. 错误背景分析 启动模式错误代码 **0x29** 表示芯片未检测到有效的启动配置,可能与以下因素相关: - **GPIO引脚电平异常**:ESP32/ESP8266 的启动模式由 GPIO0、GPIO2、GPIO15 等引脚的电平决定[^2] - **固件配置冲突**:下载工具(如 `esptool`)参数设置错误或固件格式不兼容 - **硬件连接问题**:串口通信干扰、电源不稳定或芯片物理损坏 #### 2. 详细解决步骤 **步骤 1:检查启动引脚配置** - **强制进入下载模式**: - 将 **GPIO0 接地**(GND),同时 **GPIO2 保持高电平**(3.3V)[^3] - 使用复位按钮或重新上电启动芯片 - 典型接线示例: ``` ESP32 BOOT MODE: GPIO0 --|==GND GPIO2 --|==3.3V EN --|==瞬时接地触发复位 ``` **步骤 2:验证串口通信** - 使用 `esptool.py` 执行基本检测: ```bash esptool.py --port COMX chip_id ``` 若出现 `Failed to connect` 错误,需: - 更换 USB 数据线(确保支持数据传输) - 检查 CH340/CP210x 驱动安装状态[^4] - 尝试降低烧录波特率至 **115200** **步骤 3:硬件排查** - 测量 **3.3V 电源纹波**(应 < 100mV) - 使用示波器检查 **EN 引脚复位时序**,确保复位脉冲宽度 > 100ms - 检查 **SPI 引脚连接**(GPIO6-GPIO11 避免外设干扰) **步骤 4:软件配置调整** - 在 `platformio.ini` 或 Arduino IDE 中确认: ```ini upload_flags = --before default_reset --after hard_reset monitor_speed = 115200 ``` #### 3. 高级调试技巧 - 使用逻辑分析仪捕获启动时序波形,验证引脚状态是否符合: $$t_{reset} > 100ms,\quad V_{IO} \geq 2.8V$$ - 对异常芯片进行热成像检测,排查短路/过流问题
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值