2011-8-1 20:42:02

本文通过对比不同线程数量下SLUB内存分配器的表现,详细分析了其在高速路径与慢速路径上的性能差异,并探讨了如何通过调整配置来优化整体性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

 


2011-8-1 20:42:02

 

On Mon, 1 Aug 2011, Pekka Enberg wrote:

> > More interesting than the perf report (which just shows kfree,
> > kmem_cache_free, kmem_cache_alloc dominating) is the statistics that
> > are exported by slub itself, it shows the "slab thrashing" issue
> > that I described several times over the past few years.  It's
> > difficult to address because it's a result of slub's design.  From
> > the client side of 160 netperf TCP_RR threads for 60 seconds:
> >
> >  cache  alloc_fastpath  alloc_slowpath
> >  kmalloc-256 10937512 (62.8%) 6490753
> >  kmalloc-1024 17121172 (98.3%) 303547
> >  kmalloc-4096 5526281   11910454 (68.3%)
> >
> >  cache  free_fastpath  free_slowpath
> >  kmalloc-256 15469   17412798 (99.9%)
> >  kmalloc-1024 11604742 (66.6%) 5819973
> >  kmalloc-4096 14848   17421902 (99.9%)
> >
> > With those stats, there's no way that slub will even be able to
> > compete with slab because it's not optimized for the slowpath.
>
> Is the slowpath being hit more often with 160 vs 16 threads?

Here's the same testing environment with CONFIG_SLUB_STATS for 16 threads instead of 160:

 cache  alloc_fastpath  alloc_slowpath
 kmalloc-256 4263275 (91.1%)  417445
 kmalloc-1024 4636360 (99.1%)  42091
 kmalloc-4096 2570312 (54.4%)  2155946

 cache  free_fastpath  free_slowpath
 kmalloc-256 210115   4470604 (95.5%)
 kmalloc-1024 3579699 (76.5%)  1098764
 kmalloc-4096 67616   4658678 (98.6%)

Keep in mind that this is a default slub configuration, so kmalloc-256 has
order-1 slabs and both kmalloc-1k and kmalloc-4k have order-3 slabs.  If those were decreased, the free slowpath would become even worse, and if those were increased, the alloc slowpath would become even worse.

I could probably get better numbers for 160 threads here if I let the free slowpath fall off the charts for kmalloc-256 and kmalloc-4k

 (which wouldn't be that bad, they're used 99.9% of the time) and make the alloc slowpath much easier to allocate order-0 slabs. 
 
 It depends on how often we free to a partial slab, but it's a pointless exercise since users won't tune their slab allocator settings for
 
 specific caches or each workload.

With regard to kmalloc-256 and kmalloc-4k on the 16 thread experiment, the lionshare of the allocations and free fastpath usage comes on the cpu

taking the networking irq, whereas kmalloc-1k, the lionshare of free slowpath usage comes from that cpu.

> As I said,
> the problem you mentioned looks like a *scaling issue* to me which is
> actually somewhat surprising. I knew that the slowpaths were slow but
> I haven't seen this sort of data before.
>

Well, shoot, I wrote a patchset for it and presented similar data two years ago:

https://lkml.org/lkml/2009/3/30/14 (back then, kmalloc-2k was part of the culprit and now it's kmalloc-4k). 

Although I agree that we don't want to rely on the heuristics that I created in that patchset for things like partial list

ordering and it's probably not great to have an increment on a kmem_cache_cpu variable in the allocation fastpath,

 I still strongly advocate for some logic that only picks off a partial slab from while holding the per-node list_lock
 
 when it has a certain threshold of free objects, otherwise we keep pulling a partial slab that may have one object free and performance suffers.
 
  That logic is part of the patchset that I proposed back then and it helped performance, but that still comes at the cost of increased memory because we'd be allocating new slabs (and potentially order-3 as seen above) instead of utilizing sufficient partial slabs when the number of object allocations are low.

I'm thinking this is part of the reason that Nick really advocated for optimizing for frees on remote cpus in slqb as a fundamental principle of the allocator's design.

> I snipped the 'SLUB can never compete with SLAB' part because I'm
> frankly more interested in raw data I can analyse myself. I'm hoping
> to the per-CPU partial list patch queued for v3.2 soon and I'd be
> interested to know how much I can expect that to help.
>


See my comment about having no doubt that you can improve performance of slub by throwing more memory in its direction,

that is part of what the per-cpu partial list patchset does.

在slub上增加内存,可以提供slub的性能.

 

Christoph posted it as an RFC and listed a few significant disadvantages to that approach, but I'm still happy to review it and see what can come of it.

From what I remember, though, each per-cpu partial list had a min_partial of half of what it currently is per-node.

 On my testing environment that I've been using here, they were stated to be two 16-core, 4 node systems for netperf client and server.
 
  kmalloc-256 currently has a min_partial of 8,
   and both kmalloc-1k and kmalloc-4k have min_partial of 10 for its current design of per-node partial lists,
  
   so that means we keep at minimum (absent kmem_cache_shrink() or reclaim) 8*4 kmalloc-256, 10*4 kmalloc-1k, and 10*4 kmalloc-4k empty slabs
  
    on the partial lists for later use on each of these systems.  With the per-cpu partial lists the way I remember it, that would become 4*16 kmalloc-256,
   
   
     5*16 kmalloc-1k, and 5*16 kmalloc-4k empty slabs on the partial lists.  So now we've doubled the amount of memory we've reserved for the partial lists,
    
      so yeah, I'd expect better performance as a result of using (4*16 - 8*4) more order-1 slabs and 2 *
(5*16 - 10*4) more order-3 slabs, about 700 pages for just those two caches systemwide.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org">email@kvack.org


 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值