千万并发的秘密-内核是问题的根本

最新推荐文章于 2022-06-29 14:23:51 发布

转载最新推荐文章于 2022-06-29 14:23:51 发布 · 3k 阅读

文章标签：

#操作系统 #网络编程 #解决方案 #多线程 #多核

Linux 同时被 3 个专栏收录

175 篇文章

订阅专栏

TCP/IP

45 篇文章

订阅专栏

linux networking learning

8 篇文章

订阅专栏

本文探讨了如何通过重新思考网络编程和系统架构实现千万级的并发连接处理能力。文章提出传统Unix内核不再是最佳选择，而应转向自定义驱动、多核优化及内存管理等策略。

转载自 http://www.oschina.net/translate/the-secret-to-10-million-concurrent-connections-the-kernel

参与翻译 (4人)：

DYOS, 裴宝亮, dexterman, LinuxQueen

Now that we have theC10K concurrent connection problem licked, how do we level up and support 10 million concurrent connections? Impossible you say. Nope, systems right now are delivering 10 million concurrent connections using techniques that are as radical as they may be unfamiliar.

To learn how it’s done we turn toRobert Graham, CEO of Errata Security, and his absolutely fantastic talk atShmoocon 2013 calledC10M Defending The Internet At Scale.

Robert has a brilliant way of framing the problem that I’ve never heard of before. He starts with a little bit of history, relating how Unix wasn’t originally designed to be a general server OS, it was designed to be a control system for a telephone network. It was the telephone network that actually transported the data so there was a clean separation between the control plane and the data plane. Theproblem is we now use Unix servers as part of the data plane, which we shouldn’t do at all. If we were designing a kernel for handling one application per server we would design it very differently than for a multi-user kernel.

我们现在已经搞定了C10K并发连接问题，升级一下，如何支持千万级的并发连接？你可能说，这不可能。你说错了，现在的系统可以支持千万级的并发连接，只不过所使用的那些激进的技术，并不为人所熟悉。

要了解这是如何做到的，我们得求助于Errata Security的CEORobert Graham，看一下他在Shmoocon 2013的绝对奇思妙想的演讲，题目是C10M Defending The Internet At Scale。

Robert以一种我以前从来没有听说过的才华横溢的方式来搭建处理这个问题的架构。他的开场是一些历史，关于Unix最初为什么不是设计成一个通用的服务器的OS，而是为电话网络的控制系统设计的。真正传输数据的是电话网络，因而控制层和数据层有非常清晰的区分。问题是，我们现在用的Unix服务器还是数据层的一部分，虽然并不应当是这样的。如果一台服务器只有一个应用程序，为这样的系统设计内核，与设计一个多用户系统的内核的区别是非常大的。

Which is why he says the key is to understand:

· The kernel isn’t the solution.The kernel is the problem.

Which means:

· Don’t let the kernel do all the heavy lifting. Take packet handling, memory management, and processor scheduling out of the kernel and put it into the application, where it can be done efficiently. Let Linux handle the control plane and let the the application handle the data plane.

The result will be a system that can handle 10 million concurrent connections with 200 clock cycles for packet handling and 1400 hundred clock cycles for application logic. As a main memory access costs 300 clock cycles it’s key to design in way that minimizes code and cache misses.

With adata plane oriented system you can process 10 million packets per second. With a control plane oriented system you only get 1 million packets per second.

If this seems extreme keep in mind the old saying:scalability is specialization. To do something great you can’t outsource performance to the OS. You have to do it yourself.

Now, let’s learn how Robert creates a system capable of handling 10 million concurrent connections...

这也是为什么他说重要的是要理解：

解决方案不是内核，而是问题所在。

这意味着：

不要让内核去做所有繁重的调度。把数据包处理，内存管理以及处理器调度从内核移到可以让他更高效执行的应用程序中去。让Linux去处理控制层，数据层由应用程序来处理。

结果就是成为一个用200个时钟周期处理数据包，14万个时钟周期来处理应用程序逻辑，可以处理1000万并发连接的系统。而作为重要的内存访问花费300个时钟周期，这是尽可能减少编码和缓存的设计方法的关键。

用一个面向数据层的系统你可以每秒处理1000万个数据包。用一个面向控制层的系统每秒你只能获得1百万个数据包。

如果这貌似有点极端，记住一句老话：可扩展性是专业化。要做些牛X的事儿，你不能局限于操作系的性能。你必须自己去做。

现在，让我们学习Robert是怎样创作一个能处理1000万并发连接的系统……

C10K Problem - So Last Decade

A decade ago engineers tackled the C10K scalability problems that prevented servers from handling more than 10,000 concurrent connections. This problem was solved by fixing OS kernels and moving away from threaded servers like Apache to event-driven servers like Nginx and Node. This process has taken a decade as people have been moving away from Apache to scalable servers. In the last few years we’ve seen faster adoption of scalable servers.

The Apache Problem

· The Apache problem is the more connections the worse the performance.

· Key insight:performance and scalability or orthogonal concepts. They don’t mean the same thing. When people talk about scale they often are talking about performance, but there’s a difference between scale and performance. As we’ll see with Apache.

· With short term connections that last a few seconds, say a quick transaction, if you are executing a 1000 TPS then you’ll only have about a 1000 concurrent connections to the server.

· Change the length of the transactions to 10 seconds, now at 1000 TPS you’ll have 10K connections open. Apache’s performance drops off a cliff though which opens you to DoS attacks. Just do a lot of downloads and Apache falls over.

· If you are handling 5,000 connections per second and you want to handle 10K, what do you do? Let’s say you upgrade hardware and double it the processor speed. What happens? You get double the performance but you don’t get double the scale. The scale may only go to 6K connections per second. Same thing happens if you keep on doubling. 16x the performance is great but you still haven’t got to 10K connections. Performance is not the same as scalability.

· The problem was Apache would fork a CGI process and then kill it. This didn’t scale.

· Why? Servers could not handle 10K concurrent connections because of O(n^2) algorithms used in the kernel.

o Two basic problems in the kernel:

§ Connection = thread/process. As a packet came in it would walk down all 10K processes in the kernel to figure out which thread should handle the packet

§ Connections = select/poll (single thread). Same scalability problem. Each packet had to walk a list of sockets.

o Solution: fix the kernel to make lookups in constant time

§ Threads now constant time context switch regardless of number of threads.

§ Came with a new scalable epoll()/IOCompletionPort constant time socket lookup.

· Thread scheduling still didn’t scale so servers scaled using epoll with sockets which led to the asynchronous programming model embodied in Node and Nginx. This shifted software to a different performance graph. Even with a slower server when you add more connections the performance doesn’t drop off a cliff. At 10K connections a laptop is even faster than a 16 core server.

C10K的问题——过去十年

十年前，工程师在处理C10K可扩展性问题时，都尽可能的避免服务器处理超过10,000个的并发连接。通过修正操作系统内核以及用事件驱动型服务器（如Nginx和Node）替代线程式的服务器（如Apache）这个问题已经解决。从Apache转移到可扩展的服务器上，人们用了十年的时间。在过去的几年中，（我们看到）可扩展服务器的采用率在大幅增长。

Apache的问题

Apache的问题是，（并发）连接数越多它的性能会越低下。
关键问题：（服务器的）性能和可扩展性并不是一码事。它们指的不是同一件事情。当人们谈论规模的时候往往也会谈起性能的事情，但是规模和性能是不可同日而语的。比如Apache。
在仅持续几秒的短时连接时，比如快速事务处理，如果每秒要处理1,000个事务，那么大约有1,000个并发连接到服务器。
如果事务增加到10秒，要保持每秒处理1,000个事务就必须要开启10K（10,000个）的并发连接。这时Apache的性能就会陡降，即使抛开DDos攻击。仅仅是大量的下载就会使Apache宕掉。
如果每秒需要处理的并发请求从5,000增加到10,000，你会怎么做?假使你把升级硬件把处理器速度提升为原来的两倍。会是什么情况?你得到了两倍的性能，但是却没有得到两倍的处理规模。处理事务的规模或许仅仅提高到了每秒6,000个（即每秒6,000个并发请求）。继续提高处理器速度，还是无济于事。甚至当性能提升到16倍时，并发连接数还不能达到10,000个。由此，性能和规模并不是一回事。
问题在于Apache总是创建了一个进程然后又把它关闭了，这并不是可扩展的。
为什么?因为内核采用的O(n^2)算法导致服务器不能处理10,000个并发连接。
- 内核中的两个基本问题：
  - 连接数 = 线程数/进程数。当一个包（数据包）来临时，它（内核）会遍历所有的10,000个进程以决定由哪个进程处理这个包。
  - 连接数= 选择数/轮询次数（单线程情况下）。同样的扩展性问题。每个包不得不遍历一遍列表中的socket。
- 解决方法：修正内核在规定的时间内进行查找
  - 不管有多少线程，线程切换的时间都是恒定的
  - 使用一个新的可扩展的epoll()/IOCompletionPort在规定的时间内做socket查询
由于线程调度依然没有被扩展，因此服务器对sockt大规模的采用epoll，导致需要使用异步编程模式，然而这正是Node和Nginx所采用的方式。这种软件迁移会得到（和原来）不一样的表现（指从apache迁移到ngix等）。即使在一台很慢（配置较低）的服务器上增加连接数性能也不会陡降。介于此，在开启10K并发连接时，一台笔记本电脑（运行ngix）的速度甚至超越了一台16核的服务器（运行Apache）。

The C10M Problem - The Next Decade

In the very near future servers will need to handle millions of concurrent connections. With IPV6 the number of potential connections from each server is in the millions so we need to go to the next level of scalability.

· Examples of applications that will need this sort of scalability: IDS/IPS because they connection to a server backbone. Other examples: DNS root server, TOR node, Nmap of Internet, video streaming, banking, Carrier NAT, Voip PBX, load balancer, web cache, firewall, email receive, spam filtering.

· Often people who see Internet scale problems are appliances rather than servers because they are selling hardware + software. You buy the device and insert it into your datacenter. These devices may contain an Intel motherboard or Network processors and specialized chips for encryption, packet inspection, etc.

· X86 prices on Newegg as of Feb 2013 - $5K for 40gpbs, 32-cores, 256gigs RAM. The servers can do more than 10K connections. If they can’t it’s because you’ve made bad choices with software. It’s not the underlying hardware that’s the issues. This hardware can easily scale to 10 million concurrent connections.

C10M问题 ——下一个十年

在不久的将来，服务器将需要处理数百万的并发连接。由于IPV6普及，连接到每一个服务器的潜在可能连接数目将达到数百万，所以我们需要进入下一个可扩张性阶段。

示例应用程序将会用到这类可扩张性方案：IDS/IPS，因为他们是连接到一台服务器的主干。另一个例子：DNS根服务器、TOR节点、Nmap互联网络、视频流、银行业务、NAT载体、网络语音电话业务PBX、负载均衡器、web缓存、防火墙、邮件接收、垃圾邮件过滤。

通常人们认为互联网规模问题是个人计算机而不是服务器，因为他们销售的是硬件+软件。你买的设备连接到你的数据中心。这些设备可能包含英特尔主板或网络处理器和用于加密的芯片、数据包检测，等等。

2013年2月40gpbs、32核、256gigs RAM X86在新蛋的售价为$5000。这种配置的服务器能够处理10K以上的连接。如果不能，这不是底层的硬件问题，那是因为你选错了软件。这样的硬件能够轻而易举的支持千万的并发连接。

What the 10M Concurrent Connection Challenge means:

1.10 million concurrent connections

2.1 million connections/second - sustained rate at about 10 seconds a connections

3.10 gigabits/second connection - fast connections to the Internet.

4.10 million packets/second - expect current servers to handle 50K packets per second, this is going to a higher level. Servers used to be able to handle 100K interrupts per second and every packet caused interrupts.

5.10 microsecond latency - scalable servers might handle the scale but latency would spike.

6.10 microsecond jitter - limit the maximum latency

7.10 coherent CPU cores - software should scale to larger numbers of cores. Typically software only scales easily to four cores. Servers can scale to many more cores so software needs to be rewritten to support larger core machines.

10,000,000个并发连接挑战意味着什么

1. 10,000,000个并发连接

2.每秒1,000,000个连接——每个连接大约持续10秒

3. 10千兆比特/每秒——快速连接到互联网。

4. 10,000,000包/每秒——预期当前服务器处理50,000包/每秒，这将导致更高的级别。服务器能够用来处理每秒100,000个中断和每个包引发的中断。

5. 10微秒延迟——可扩张的服务器也许能够处理这样的增长，但是延迟将会很突出。

6. 10微秒上下跳动——限制最大延迟

7. 10个一致的CPU内核——软件应该扩张到更多内核。典型的软件只是简单的扩张到四个内核。服务器能够扩张到更多的内核，所以软件需要被重写以支持在拥有更多内核的机器上运行。

We’ve Learned Unix Not Network Programming

· A generation of programmers has learned network programming by reading Unix Networking Programming by W. Richard Stevens. The problem is the book is about Unix, not just network programming. It tells you to let Unix do all the heavy lifting and you just write a small little server on top of Unix. But the kernel doesn’t scale. The solution is to move outside the kernel and do all the heavy lifting yourself.

· An example of the impact of this is to consider Apache’s thread per connection model. What this means is the thread scheduler determines which read() to call next depending on which data arrives.You are using the thread scheduling system as the packet scheduling system. (I really like this, never thought of it that way before).

· What Nginx says it don’t use thread scheduling as the packet scheduler. Do the packet scheduling yourself. Use select to find the socket, we know it has data so we can read immediately and it won’t block, and then process the data.

· Lesson: Let Unix handle the network stack, but you handle everything from that point on.

我们学的是Unix而不是网络编程（Network Programming）

一代代的程序员通过W. Richard Stevens所著的《Unix网络编程》（Unix Networking Programming）学习网络编程技术。问题是，这本书是关于Unix的，并不是网络编程。它讲述的是，你仅需要写一个很小的轻量级的服务器就可以让Unix做一切繁复的工作。然而内核并不是规模的（规模不足）。解决方案是，将这些繁复的工作转移到内核之外，自已处理。
一个颇具影响的例子，就是在考虑到Apache的线程每个连接模型（is to consider Apache’s thread per connection model）。这就意味着线程调度器根据到来的数据（on which data arrives）决定调用哪一个（不同的）read()函数（方法）。把线程调度系统当做（数据）包调度系统来使用（我非常喜欢这一点，之前从来没听说过类似的观点）。
Nginx宣称，它并不把线程调度当作（数据）包调度来用使用，它自已做（进行）包调度。使用select来查找socket，我们知道数据来了，于是就可以立即读取并处理它，数据也不会堵塞。
经验：让Unix处理网络堆栈，之后的事情就由你自已来处理。

How do you write software that scales?

How do change your software to make it scale? A lot of or rules of thumb are false about how much hardware can handle. We need to know what the performance capabilities actually are.

To go to the next level the problems we need to solve are:

1.packet scalability

2.multi-core scalability

3.memory scalability

你怎么编写软件使其可伸缩？

你怎么改变你的软件使其可伸缩？有大量的经验规则都是假设硬件能处理多少。我们需要真实的执行性能。

要进入下一个等级，我们需要解决的问题是：

包的可扩展性
多核的可扩展性
内存的可扩展性

Packet Scaling - Write Your Own Custom Driver to Bypass the Stack

· The problem with packets is they go through the Unix kernel. The network stack is complicated and slow. The path of packets to your application needs to be more direct. Don’t let the OS handle the packets.

· The way to do this is to write your own driver. All the driver does is send the packet to your application instead of through the stack. You can find drivers: PF_RING, Netmap, Intel DPDK (data plane development kit). The Intel is closed source, but there’s a lot of support around it.

· How fast? Intel has a benchmark where the process 80 million packets per second (200 clock cycles per packet) on a fairly lightweight server. This is through user mode too. The packet makes its way up through to user mode and then down again to go out. Linux doesn’t do more than a million packets per second when getting UDP packets up to user mode and out again. Performance is 80-1 of a customer driver to a Linux.

· For the 10 million packets per second goal if 200 clock cycles are used in getting the packet that leaves 1400 clocks cycles to implement functionally like a DNS/IDS.

· With PF_RING you get raw packets so you have to do your TCP stack. People are doing user mode stacks. For Intel there is an available TCP stack that offers really scalable performance.

精简包-编写自己的定制驱动来绕过堆栈

数据包的存在的问题是它们要通过Unix的内核。网络堆栈复杂又慢。你的应用程序需要的数据包的路径要更加直接。不要让操作系统来处理数据包。
做到这一点的方法是编写自己的驱动程序。所有驱动程序要做到是发送数据包到你的应用程序而不是通过堆栈。你可以找得到驱动有：PF_RING,Netmap,Interl DPDK(数据层开发套件)，英特尔是闭源的，但是有许多绕开它的支持。
有多快呢？Inter有一个基准是在一个轻量级的服务器上每秒可以处理8000万的数据包（每个数据包200个时钟周期）。这也是通过用户模式。数据包通过用户模式后再向下传递。当Linux获得UDP数据包后通过用户模式在向下传递时，它每秒处理的数据包不会超过100万个。客户驱动对Linux来说性能比是80:1。
如果用200个时间周期来每秒获得1000万个数据包，那么可以剩下1400个时钟周期来实现一个类似DNS/IDS的功能。
用PF_RING来获得原始的数据包的话，你必须自己去做TCP协议栈。人们正在做用户模式的堆栈。对于Inter来讲已有一个提供真正可扩展性能的可用的TCP堆栈。

Multi-Core Scalability

Multi-core scalability is not the same thing as multi-threading scalability. We’re all familiar with the idea processors are not getting faster, but we are getting more of them.

Most code doesn’t scale past 4 cores. As we add more cores it’s not just that performance levels off, we can get slower and slower as we add more cores. That’s because software is written badly. We want software as we add more cores to scale nearly linearly. Want to get faster as we add more cores.

Multi-threading coding is not multi-core coding

· Multi-threading:

o More than one thread per CPU core

o Locks to coordinate threads (done via system calls)

o Each thread a different task

· Multi-core:

o One thread per CPU core

o When two threads/cores access the same data they can’t stop and wait for each other

o All threads part of the same task

· Our problem is how to spread an application across many cores.

· Locks in Unix are implemented in the kernel. What happens at 4 cores using locks is that most software starts waiting for other threads to give up a lock. So the kernel starts eating up more performance than you gain from having more CPUs.

· What we need is an architecture that is more like a freeway than an intersection controlled by a stop light. We want no waiting where everyone continues at their own pace with as little overhead as possible.

· Solutions:

- Keep data structures per core. Then on aggregation read all the counters.
- Atomics. Instructions supported by the CPU that can called from C. Guaranteed to be atomic, never conflict. Expensive, so don’t want to use for everything.
- Lock-free data structures. Accessible by threads that never stop and wait for each other. Don’t do it yourself. It’s very complex to work across different architectures.
- Threading model. Pipelined vs worker thread model. It’s not just synchronization that’s the problem, but how your threads are architected.
- Processor affinity. Tell the OS to use the first two cores. Then set where your threads run on which cores. You can also do the same thing with interrupts. So you own these CPUs and Linux doesn’t.

多核的可扩展性

多核的可扩展性和多线程可扩展性是不一样的。我们熟知的idea处理器不在渐渐变快，但是我们却拥有越来越多的idea处理器。

大多数代码并不能扩展到4核。当我们添加更多的核心时并不是性能不变，而是我们添加更多的核心时越来越慢。因为我们编写的代码不好。我们期望软件和核心成线性的关系。我们想要的是添加更多的核心就更快。

多线程编程不是多核编程

· 多线程：
每个CPU有多个线程

o 锁来协调线程（通过系统调用）

o 每个线程有不同的任务

· 多核：

- 每个CPU核心一个线程
- 当两个核心中的两个不同线程访问同一数据时，它们不用停止来相互等待
- 所有线程是同一任务的一部分
我们的问题是如何让一个程序能扩展到多个核心。
Unix中的锁是在内核中实现的。在4核心上使用锁会发生什么？大多数软件会等待其他线程释放一个锁。这样的以来你有更多的CPU核心内核就会耗掉更多的性能。
我们需要的是一个像高速公路的架构而不是一个像靠红绿灯控制的十字路口的架构。我们想用尽可能少的小的开销来让每个人在自己的节奏上而没有等待。
解决方案:
- 保持每一个核心的数据结构，然后聚集起来读取所有的组件。
- 原子性. CPU支持的指令集可以被C调用。保证原子性且没有冲突是非常昂贵的，所以不要期望所有的事情都使用指令。
- 无锁的数据结构。线程间访问不用相互等待。不要自己来做，在不同架构上来实现这个是一个非常复杂的工作。
- 线程模型。线性线程模型与辅助线程模型。问题不仅仅是同步。而是怎么架构你的线程。
- 处理器族。告诉操作系统使用前两个核心。之后设置你的线程运行在那个核心上。你也可以使用中断来做同样的事儿。所以你有多核心的CUP，但这不关Linux啥吊事儿。

Memory Scalability

· The problem is if you have 20gigs of RAM and let’s say you use 2k per connection, then if you only have 20meg L3 cache, none of that data will be in cache. It costs 300 clock cycles to go out to main memory, at which time the CPU isn’t doing anything.

· Think about this with our 1400 clock cycle budge per packet. Remember 200 clocks/pkt overhead. We only have 4 cache misses per packet and that's a problem.

· Co-locate Data

o Don’t scribble data all over memory via pointers. Each time you follow a pointer it will be a cache miss: [hash pointer] -> [Task Control Block] -> [Socket] -> [App]. That’s four cache misses.

o Keep all the data together in one chunk of memory: [TCB | Socket | App]. Prereserve memory by preallocating all the blocks. This reduces cache misses from 4 to 1.

· Paging

o The paging table for 32gigs require 64MB of paging tables which doesn’t fit in cache. So you have two caches misses, one for the paging table and one for what it points to. This is detail we can’t ignore for scalable software.

o Solutions: compress data; use cache efficient structures instead of binary search tree that has a lot of memory accesses

o NUMA architectures double the main memory access time. Memory may not be on a local socket but is on another socket.

· Memory pools

o Preallocate all memory all at once on startup.

o Allocate on a per object, per thread, and per socket basis.

· Hyper-threading

o Network processors can run up to 4 threads per processor, Intel only has 2.

o This masks the latency, for example, from memory accesses because when one thread waits the other goes at full speed.

· Hugepages

o Reduces page table size. Reserve memory from the start and then your application manages the memory.

内存的可扩展性

问题：假设你有20G内存（RAM），第个连接占用2K，假如你只有20M三级缓存（L3 cache），缓存中没有数据。从缓存转移到主存上消耗300个时钟周期，此时CPU处于空闲状态。
想象一下，（处理）每个包要1400个时钟周期。切记还有200时钟周期/每包的开销（应该指等待包的开销）。每个包有4次高速缓存的缺失，这是个问题。
协同定位数据
- 不要使用指针在整个内存中随便乱放数据。每次你跟踪一个指针都会造成一次高速缓存缺失：[hash pointer] -> [Task Control Block] -> [Socket] -> [App]。这造成了4次高速缓存缺失。
- 将所有的数据保持在一个内存块中：[TCB | Socket | App].为每个内存块预分配内存。这样会将高速缓存缺失从4降低到1。
分页
- 32G的数据需要占用64M的分页表，不适合都放在高速缓存上。所以造成2个高速缓存缺失，一个是分页表另一个是它指向的数据。这些细节在开发可扩展软件时是不可忽略的。
- 解决：压缩数据，使用有很多内存访问的高速架构，而不是二叉搜索树。
- NUMA加倍了主内存的访问时间。内存有可能不在本地，而在其它地方。
内存池
- 在启动时立即分配所有的内存。
- 在对象（object）、线程（thread）和socket的基础上分配（内存）。
超线程
- （一个）网络处理器能运行4个线程，Intel只能运行2个。
- 掩盖延迟，比如，当在内存访问中一个线程等待另一个全速线程。
大内存页
- 减小页表的大小。从一开始就预留内存，并且让应用程序管理（内存）。

Summary

NIC
- Problem: going through the kernel doesn’t work well.
- Solution: take the adapter away from the OS by using your own driver and manage them yourself
CPU
- Problem: if you use traditional kernel methods to coordinate your application it doesn’t work well.
- Solution: Give Linux the first two CPUs and you application manages the remaining CPUs. No interrupts will happen on those CPUs that you don’t allow.
Memory
- Problem: Takes special care to make work well.
- Solution: At system startup allocate most of the memory in hugepages that you manage.

The control plane is left to Linux, for the data plane, nothing. The data plane runs in application code. It never interacts with the kernel. There’s no thread scheduling, no system calls, no interrupts, nothing.

Yet, what you have is code running on Linux that you can debug normally, it’s not some sort of weird hardware system that you need custom engineer for. You get the performance of custom hardware that you would expect for your data plane, but with your familiar programming and development environment.

总结

网卡（NIC,Network Interface Card）
- 问题：通过内核驱动并不完美。
- 解决：使用你自已的驱动程序管理它（网卡），使适配器（网卡）远离操作系统（内核）。
CPU
- 问题：采用传统的内核方法（即使用内核驱动）来协调应用程序，效果并不是很好。
- 解决：让Linux管理前两个CPU，您的应用程序管理其余的CPU。这样中断就不会发生在不允许的CPU上。
内存
- 问题：为了使其更好的工作，（内存）需要特别的关照。
- 解决：在系统启动时就给你管理的大页面（hugepages ）分配大多数的内存.