List集合去重

最新推荐文章于 2024-12-26 17:44:18 发布

weixin_30404405

最新推荐文章于 2024-12-26 17:44:18 发布

阅读量74

点赞数

CC 4.0 BY-SA版权

文章标签：数据结构与算法

原文链接：http://www.cnblogs.com/zrhai/p/3783796.html

本文详细解析了使用Linq的Distinct方法进行集合去重时的注意事项，包括自定义比较器的实现、匿名类型的应用以及基于哈希表的去重优化。通过对比不同方法的性能表现，阐述了如何有效提升去重效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

今天看到别人的代码使用 Linq 的Distinct进行去重。发现有很多需要注意的地方。

实现方式如下：

Linq 的Distinct方法需要传递一个自己实现IEqualityComparer<TSource>的类来作为比较器。

 
    public static class EnumerableExtension
    {
        public static IEnumerable<T> Distinct<T, V>(this IEnumerable<T> source, Func<T, V> predicate)
        {
            return source.Distinct(new CommonEqualityComparer<T, V>(predicate));
        }

        public static IEnumerable<T> Distinct<T, V>(this IEnumerable<T> source, Func<T, V> predicate, IEqualityComparer<V> comparer)
        {
            return source.Distinct(new CommonEqualityComparer<T, V>(predicate, comparer));
        }
    }

    public class CommonEqualityComparer<T, V> : IEqualityComparer<T>
    {
        private Func<T, V> keySelector;
        private IEqualityComparer<V> comparer;

        public CommonEqualityComparer(Func<T, V> keySelector, IEqualityComparer<V> comparer)
        {
            this.keySelector = keySelector;
            this.comparer = comparer;
        }

        public CommonEqualityComparer(Func<T, V> keySelector)
            : this(keySelector, EqualityComparer<V>.Default)
        { }

        public bool Equals(T x, T y)
        {
            return comparer.Equals(keySelector(x), keySelector(y));
        }

        public int GetHashCode(T obj)
        {
            return comparer.GetHashCode(keySelector(obj));
        }
    }

这段代码对IEnumerable<T>进行了扩展，然后实现一个IEqualityComparer<T>接口的比较器，并增加了一个Func<T, V> keySelector 可以用来选择需要去重的字段。

当使用如下方法时，如果不实现 IEqualityComparer<V> comparer

比较器内部会使用EqualityComparer<V>.Default 提供的默认Equals()，GetHashCode()方法。

根据MSDN的解释，如果没有重写这两个方法，则将基于对象的引用来判断是否相等和取哈希值。

对于两个具有相同内容，但是不同地址引用的对象，Equals()将返回false，GetHashCode() 也不相等。这样就不能达到去重的目的。

public static IEnumerable<T> Distinct<T, V>(this IEnumerable<T> source, Func<T, V> predicate)
        {
            return source.Distinct(new CommonEqualityComparer<T, V>(predicate));
        }

下面是测试例子

定义两个实体

    class RoomEntity
    {
        public int Hotel { get; set; }
        public string Room { get; set; }
        public DateTime EffectDate { get; set; }
        public DateTime ChangeDate { get; set; }
        public object obj { get; set; }
    }

    class RoomEntity2
    {
        public int Hotel { get; set; }
        public string Room { get; set; }
        public DateTime EffectDate { get; set; }
    }

造三十万数据

        IList<RoomEntity> sourceList = new List<RoomEntity>();
            for (int i = 0; i < 300000; i++)
            {
                RoomEntity en = new RoomEntity();
                en.Hotel = i % 150000;
                en.Room = (i % 150000).ToString();
                en.EffectDate = DateTime.Today;
                en.ChangeDate = DateTime.Now;
                en.obj = new object();
                sourceList.Add(en);
            }

　　使用扩展方法去重

        Stopwatch watch = new System.Diagnostics.Stopwatch();
            List<RoomEntity2> result = new List<RoomEntity2>();
            //linq distinct
            watch.Start();
            var temp = sourceList.Distinct(p => new RoomEntity2{ Hotel = p.Hotel, Room = p.Room, EffectDate = p.EffectDate });
            foreach (var a in temp)
            {
                result.Add(new RoomEntity2
                {
                    Hotel = a.Hotel,
                    Room = a.Room,
                    EffectDate = a.EffectDate
                });
            }
            watch.Stop();
            Console.WriteLine("数量：" + result.Count + "耗时：" + watch.ElapsedMilliseconds);

　　返回结果数量：300000 耗时：193 并没有达到去重目的。

换一种写法，使用匿名类型

Stopwatch watch = new System.Diagnostics.Stopwatch();
            List<RoomEntity2> result = new List<RoomEntity2>();
            //linq distinct
            watch.Start();
            var temp = sourceList.Distinct(p => new { Hotel = p.Hotel, Room = p.Room, EffectDate = p.EffectDate });
            foreach (var a in temp)
            {
                result.Add(new RoomEntity2
                {
                    Hotel = a.Hotel,
                    Room = a.Room,
                    EffectDate = a.EffectDate
                });
            }
            watch.Stop();
            Console.WriteLine("数量：" + result.Count + "耗时：" + watch.ElapsedMilliseconds);

　　返回结果数量：150000 耗时：192 可以达到去重目的。

因为匿名类型上的 Equals 和 GetHashCode 方法是根据各属性的 Equals 和 GetHashcode 定义的，因此仅当同一匿名类型的两个实例的所有属性都相等时，这两个实例才相等。

int,string datetime 也都重写了Equals 和 GetHashCode 方法，所以能达到去重的目的。

但如果匿名类型又有一般引用类型则还是无法重。

  Stopwatch watch = new System.Diagnostics.Stopwatch();
            List<RoomEntity2> result = new List<RoomEntity2>();
            //linq distinct
            watch.Start();
            var temp = sourceList.Distinct(p => new { Hotel = p.Hotel, Room = p.Room, EffectDate = p.EffectDate,obj=p.obj });
            foreach (var a in temp)
            {
                result.Add(new RoomEntity2
                {
                    Hotel = a.Hotel,
                    Room = a.Room,
                    EffectDate = a.EffectDate
                });
            }
            watch.Stop();
            Console.WriteLine("数量：" + result.Count + "耗时：" + watch.ElapsedMilliseconds);

　　返回结果数量：300000 耗时：311 没有达到去重目的。

个人比较偏好，自己实现一个去重方法——基于哈希表来判断是否有重复数据。

    List<RoomEntity2> result2 = new List<RoomEntity2>();
            HashSet<string> hashSet = new HashSet<string>();
            foreach (var a in sourceList)
            {
                //StringBuilder sb = new StringBuilder();
                //sb.Append(a.Hotel);
                //sb.Append("_");
                //sb.Append(a.Room);
                //sb.Append("_");
                //sb.Append(a.EffectDate.ToString("yyyy-MM-dd"));//耗性能

                string key = a.Hotel + "^" + a.Room + "^" + a.EffectDate.ToString("yyyy-MM-dd");  //拼接耗性能
                if (hashSet.Add(key))
                {
                    result2.Add(new RoomEntity2
                    {
                        Hotel = a.Hotel,
                        Room = a.Room,
                        EffectDate = a.EffectDate
                    });
                }
            }
            watch.Stop();

            Console.WriteLine("数量：" + result2.Count + "耗时：" + watch.ElapsedMilliseconds);

　　返回结果数量：150000 耗时：525 性能会比LINQ 的差。经过对比，时间消耗主要是字符拼接上面。换成StringBuilder 也是一样。

于是试着改良一下，也使用匿名对象来进行哈希。

 List<RoomEntity2> result2 = new List<RoomEntity2>();
            HashSet<object> hashSet = new HashSet<object>();
            foreach (var a in sourceList)
            {
                var obj = new { a.Hotel, a.Room, a.EffectDate };//性能快
                //由于匿名类型上的 Equals 和 GetHashCode 方法是根据方法属性的 Equals 和 GetHashcode 定义的，
                //因此仅当同一匿名类型的两个实例的所有属性都相等时，这两个实例才相等。

                if (hashSet.Add(obj))
                {
                    result2.Add(new RoomEntity2
                    {
                        Hotel = a.Hotel,
                        Room = a.Room,
                        EffectDate = a.EffectDate
                    });
                }
            }
            watch.Stop();

            Console.WriteLine("数量：" + result2.Count + "耗时：" + watch.ElapsedMilliseconds);

　　返回结果数量：150000 耗时：140 性能会比LINQ 好。这种写法还是要注意的地方：匿名类型的字段中不能有引用类型。除非该引用类型实现了正确的Equals 和 GetHashCode 方法。

里面用到了哈希表，检索速度取决于为 TKey 指定的类型的哈希算法的质量。需要一个相等实现来确定键是否相等，可以使用一个接受 comparer 参数的构造函数来指定 IEqualityComparer 泛型接口的实现；

如果不指定实现，则使用默认的泛型相等比较器 EqualityComparer.Default。一般我们去重的字段都基于数值，字符和时间类型，默认的相等比较器应该能满足需求。

参考：

http://msdn.microsoft.com/zh-cn/library/xfhwa508%28v=vs.80%29.aspx

转载于:https://www.cnblogs.com/zrhai/p/3783796.html