python列表去重方法

本文介绍了几种在Python中去除列表重复元素的方法,并通过基准测试对比了不同方法的性能。其中包括保持顺序与不保持顺序的去重方式,以及支持对象列表去重的方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

原文链接:http://www.peterbe.com/plog/uniqifiers-benchmark

Suppose you have a list in python that looks like this:

['a','b','a']
# or like this:
[1,2,2,2,3,4,5,6,6,6,6]

and you want to remove all duplicates so you get this result:

['a','b']
# or
[1,2,3,4,5,6]

How do you do that? ...the fastest way? I wrote a couple of alternative implementations and did a quick benchmark loop on the various implementations to find out which way was the fastest. (I haven't looked at memory usage). The slowest function was 78 times slower than the fastest function.

However, there's one very important difference between the various functions. Some are order preserving and some are not. For example, in an order preserving function, apart from the duplicates, the order is guaranteed to be the same as it was inputted. Eg, uniqify([1,2,2,3])==[1,2,3]

Here are the functions:

def f1(seq):
   # not order preserving
   set = {}
   map(set.__setitem__, seq, [])
   return set.keys()

def f2(seq): 
   # order preserving
   checked = []
   for e in seq:
       if e not in checked:
           checked.append(e)
   return checked

def f3(seq):
   # Not order preserving
   keys = {}
   for e in seq:
       keys[e] = 1
   return keys.keys()

def f4(seq): 
   # order preserving
   noDupes = []
   [noDupes.append(i) for i in seq if not noDupes.count(i)]
   return noDupes

def f5(seq, idfun=None): 
   # order preserving
   if idfun is None:
       def idfun(x): return x
   seen = {}
   result = []
   for item in seq:
       marker = idfun(item)
       # in old Python versions:
       # if seen.has_key(marker)
       # but in new ones:
       if marker in seen: continue
       seen[marker] = 1
       result.append(item)
   return result

def f6(seq):
   # Not order preserving    
   set = Set(seq)
   return list(set)

And what you've all been waiting for (if you're still reading). Here are the results:

* f2 13.24
* f4 11.73
* f5 0.37
f1 0.18
f3 0.17
f6 0.19

(* order preserving)

Clearly f5 is the "best" solution. Not only is it really really fast; it's also order preserving and supports an optional transform function which makes it possible to do this:

>>> a=list('ABeeE')
>>> f5(a)
['A','B','e','E']
>>> f5(a, lambda x: x.lower())
['A','B','e'] 

Download the benchmark script here

UPDATE

From the comments I've now added a couple of more functions to the benchmark. Some which don't support uniqify a list of objects that can't be hashed unless passed with a special hashing method. So see all the functions download the file

Here are the new results:

* f5 10.1
* f5b 9.99
* f8 6.49
* f10 6.57
* f11 6.6
f1 4.28
f3 3.55
f6 4.03
f7 2.59
f9 2.58

(f2 and f4) were too slow for this testdata.

### Python 列表方法 对于给定的一个列表,其中可能含有复项,可以采用多种方式除这些复元素。一种常见的方式是利用集合(`set`),因为集合不允许有复成员存在。 ```python def remove_duplicates_with_set(lst): return list(set(lst)) ``` 上述函数通过将列表转换成集合作业来移除所有复条目[^1]。然而这种方法不会保留原始列表中元素的顺序。如果保持原有顺序很要,则可考虑如下方案: ```python def remove_duplicates_preserve_order(lst): unique_lst = [] for item in lst: if item not in unique_lst: unique_lst.append(item) return unique_lst ``` 此段代码遍历输入列表中的每一个项目并仅当该项尚未存在于新创建的结果列表内时才将其加入到结果集中,从而实现既消除冗余又维持原序列特性的目的[^2]。 另外,在Python 3.7及以上版本里,字典(`dict`)也能够用来完成这项工作,得益于其内部实现了有序存储特性: ```python def remove_duplicates_use_dict(lst): return list(dict.fromkeys(lst)) ``` 这段程序借助于 `dict.fromkeys()` 方法构建了一个临时字典对象——它自动忽略了后续遇到相同键的情况;之后再把该字典转回为列表形式返回即可达到预期效果[^3]。 ### 练习题目建议 为了更好地掌握如何处理数据结构以及理解不同算法之间的差异性,这里提供几个练习方向供参考: - 编写一个接受任意长度整数数组作为参数的功能模块,输出经过删除副本后的升序排列的新数组; - 设计一段逻辑判断语句用于检测两个字符串是否由完全相同的字符组成(忽略大小写的区别),即所谓的变位词问题; - 实现一个简单的文本分析器,统计一篇文档中最常出现单词的同时确保不计入停用词(stop words)的影响范围之内。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值