Section #7. Debugging Network Throughput
To obtain optimal network throughput, you need to design your network interface card (NIC) driver for high performance. Additionally, you need an in-depth understanding of the network protocol that your driver ferries. To get a flavor of how to debug network throughput problems, let’s examine some device driver design issues and protocol implementation characteristics that can affect the horse power of your NIC.
要获得最佳网络吞吐量,您需要设计网络接口卡(NIC)驱动程序以获得高性能。此外,您需要深入了解驱动程序传输的网络协议。为了了解如何调试网络吞吐量问题,我们来看一些可能影响NIC的性能的设备驱动程序设计问题和协议实现特性。
Driver Performance
Let’s first take a look at some driver design issues that can affect your NIC’s performance:
我们首先来看一些可能影响NIC性能的驱动程序设计问题:
• Minimizing the number of instructions in the main data path is a key criterion while designing drivers for fast NICs. Consider a 1 Gbps Ethernet adapter with 1 MB of onboard memory. At line rate, the card memory can hold up to 8 milliseconds of received data. This directly translates to the maximum allowable instruction path length. Within this path length, incoming packets have to be reassembled, DMA-ed to memory, processed by the driver, protected from concurrent access, and delivered to higher layer protocols. 在为快速NIC设计驱动程序时,最小化主数据传输路径中的指令数是一个关键标准。考虑具有1 MB板载内存的1 Gbps以太网适配器。在线速率下,卡存储器可以容纳最多8毫秒的接收数据。这直接转换为允许的最大指令路径长度。在此路径长度内,传入的数据包必须重组,DMA到内存,驱动程序接收,防止并发访问,并传递给更高层协议。
• During programmed I/O (PIO), data travels all the way from the device to the CPU before it gets written to memory. Moreover, the CPU is interrupted whenever the device needs to transfer data, and this contributes to latencies and context switch delays. DMAs do not suffer from these bottlenecks but can turn out to be more expensive than PIOs if the data to be transferred is less than a threshold. This is because small DMAs have high relative overheads for building descriptors and flushing corresponding processor cache lines for data coherency. A performance-sensitive device driver might use PIO for small packets and DMA for larger ones, after experimentally determining the threshold. 在编程I/O(PIO)期间,数据一直从设备传输到CPU,再到内存。此外,只要设备需要传输数据,CPU就会中断,这会导致延迟和上下文切换延迟。DMA不会受到这些影响,但如果要传输的数据小于阈值,则可能比PIO更昂贵。这是因为小DMA具有用于构建描述符和刷新相应处理器高速缓存行以获得数据一致性的高相对开销。在通过实验确定阈值后,对性能敏感的设备驱动程序可能会将PIO用于小数据包,而DMA则用于较大数据包。
• For PCI network cards having DMA mastering capability, you need to determine the optimal DMA burst size, which is the time for which the card controls the bus at one stretch. If the card bursts for a long duration, it might hog the bus and prevent the processor from keeping up with data DMA-ed previously. PCI drivers program the burst size via a register in the PCI configuration space. Normally the NIC’s burst size is programmed to be the same as the cache line size of the processor, which is the number of bytes that the processor reads from system memory each time there is a cache miss. In practice, however, you might need to connect a bus analyzer to determine the beneficial burst duration because factors such as the presence of a split bus (multiple bus types like ISA and PCI) on your system can influence the optimal value. 对于具有DMA主控功能的PCI网卡,您需要确定最佳DMA大小,即卡在一段时间内控制总线的时间。如果卡持续很长时间,它可能会占用总线并阻止CPU处理之前DMA的数据。PCI驱动程序通过PCI配置空间中的寄存器对突发大小进行控制。通常,NIC的突发大小被设定为与CPU的高速缓存大小相同,CPU的高速缓存大小是每次高速缓存未命中时CPU从系统内存读取的字节数。但实际上,您可能需要连接总线分析仪以确定有益的脉冲串持续时间,因为系统上存在分离总线(多种总线类型,如ISA和PCI)等因素会影响最佳值。
• Many high-speed NICs offer the capability to offload the CPU-intensive computation of TCP checksums from the protocol stack. Some support DMA scatter-gather. The driver needs to leverage these capabilities to achieve the maximum practical bandwidth that the underlying network yields. 许多高速NIC提供了从协议栈卸载CPU密集型TCP校验和的能力。一些支持DMA分散-聚集。驱动程序需要利用这些功能来实现底层网络产生的最大实际带宽。
• Sometimes, a driver optimization might create unexpected speed bumps if it’s not sensitive to the implementation details of higher protocols. Consider an NFS-mounted filesystem on a computer equipped with a high-speed NIC. Assume that the NIC driver takes only occasional transmit complete interrupts to minimize latencies but that the NFS server implementation uses freeing of its transmit buffers as a flow-control mechanism. Because the driver frees NFS transmit buffers only during the sparsely generated transmit complete interrupts, file copies over NFS crawl, even as Internet downloads zip along yielding maximum throughput. 有时,如果驱动程序优化对较高协议的实现细节不敏感,则可能会产生意外的速度颠簸。以配备高速NIC的计算机上安装NFS的文件系统为例。假设NIC驱动程序仅发送传输完成中断以最小化延迟,但NFS服务器使用释放其传输缓冲区作为流控机制。因为驱动程序仅在稀疏的传输完成中断期间释放NFS传输缓冲区,所以通过NFS进行文件复制,即使在Internet下载时也会产生最大吞吐量。
Protocol Performance
Let’s now dig into some protocol-specific characteristics that can boost or hurt network throughput: 现在让我们深入研究一些特定于协议的特性,这些特性会增加或损害网络吞吐量:
• TCP window size can impact throughput. The window size provides a measure of the amount of data that can be transmitted before receiving an acknowledgment. For fast NICs, a small window size might result in TCP sitting idle, waiting for acknowledgments of packets already transmitted. Even with a large window size, a small number of lost TCP packets can affect performance because lost frames can use up the window at line speeds. In the case of UDP, the window size is not relevant because it does not support acknowledgments. However, a small packet loss can spiral into a big rate drop due to the absence of flow-control mechanisms. TCP窗口尺寸会影响吞吐量。窗口尺寸提供了在接收确认之前可以传输的数据量的尺寸。对于快速NIC,窗口太小可能导致TCP空闲,等待传输完成的数据包的确认。即使窗口很大,少量丢失的TCP数据包也会影响性能,因为丢失的帧会以线速度耗尽窗口。在UDP的情况下,窗口大小不相关,因为它不支持确认。然而,由于缺少流量控制机制,小数据包的丢失可能会导致网络吞吐大幅度下降。
• As the block size of application data written to TCP sockets increases, the number of buffers copied from user space to kernel space decreases. This lowers the demand on processor utilization and is good for performance. If the block size crosses the MTU corresponding to the network protocol, however, processor cycles get wasted on fragmentation. The desirable block size is thus the outgoing interface MTU, or the largest packet that can be sent without fragmentation through an IP path if Path MTU discovery mechanisms are in operation. While running IP over ATM, for example, because the ATM adaptation layer has a 64K MTU, there is virtually no upper bound on block size. (RFC 1626 defaults this to 9180.) If you run IP over ATM LANE, however, the block size should mirror the MTU size of the respective LAN technology being emulated. It should thus be 1500 for standard Ethernet, 8000 for jumbo Gigabit Ethernet, and 18 K for 16 Mbps Token Ring. 随着写入TCP套接字的应用程序数据的块尺寸增加,从用户空间复制到内核空间的缓冲区数量减少。这降低了对CPU利用率的需求,并且有利于提高性能。但是,如果块大小超过对应于网络协议的MTU,则CPU周期会因碎片而浪费。因此,期望的块大小是输出接口MTU,或者如果路径MTU发现机制工作正常,则可以在没有分片的情况下发送的最大分组。例如,在ATM上运行IP时,由于ATM适配层具有64K MTU,因此块大小几乎没有上限。(RFC 1626默认为9180.)但是,如果在ATM LANE上运行IP,则块大小应该反映相应LAN技术的MTU大小。因此,对于标准以太网应为1500,对于巨型千兆以太网应为8000,对于16 Mbps令牌环应为18 K.
Several tools are available to benchmark network performance. Netperf, available for free from www.netperf.org/, can set up complex TCP/UDP connection scenarios. You can use scripts to control characteristics such as protocol parameters, number of simultaneous sessions, and size of data blocks. Benchmarking is accomplished by comparing the resulting throughput with the maximum practical bandwidth that the networking technology yields. For example, a 155 Mbps ATM adapter produces a maximum IP throughput of 135 Mbps, taking into account the ATM cell header size, overheads due to the ATM Adaptation Layer (AAL), and the occasional maintenance cells sent by the physical Synchronous Optical Networking (SONET) layer.
有几种工具可用于测试网络性能。Netperf可从www.netperf.org/免费获得,可以设置复杂的TCP / UDP连接方案。您可以使用脚本来控制协议参数,例如会话数和数据块大小等。 通过将产生的吞吐量与网络技术产生的最大实际带宽进行比较来完成基准测试。例如,155 Mbps ATM适配器产生最大135 Mbps的IP吞吐量,考虑到ATM信元报头大小,ATM适配层(AAL)引起的开销,以及物理同步光纤网络偶尔发送的维护信元( SONET)。