哈希算法实现在海量数据里提取重复值

本文分享了一次使用哈希算法优化Excel数据处理流程的经历,从普通的查找算法耗时3小时到改进后的算法仅用3秒处理75420条数据中的重复值,效率显著提升。通过对比V4.0与V5.0版本的代码,展示了哈希算法在查找重复数据时的高效应用,以及如何在实际工作中优化数据处理过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

那天给兰州石化矿区服务事业部的领导处理Excel表格!发现在35400行数据中提取重复值时,竟然耗时3小时,当时使用的是普通算法进行查找的!严重的时间消耗让我不得不修改提取重复值的程序了!之后第二天,我详细的研究了一下。突然,想起在大学时,主修计算机科学与技术时的数据结构算法-哈希算法!然后,重新开始复习功课......不久,我搞出了这个算法,真感叹这优秀的算法的效率!在P4CPU+1GB的内存的硬件支持下,查找并提取75420条数据中的重复值,耗时才3秒!各位请注意了,是3秒呀!哈哈!速度是超级超级的快!眨眼间,将所有的重复值给提取出来了!v5.0与v4.0相比,耗时几乎可以忽略不计!下面将该段VBA代码公布出来,与大家一起研究!
'*************************************************************
'             哈希算法查找资产号重复值程序 v5.0
'使用普通的算法,对每个值进行向下遍历,虽然效率上已经提高了一倍,
'但是在运算时间上任然耗时非常严重,19840条数据竟然耗时2小时40分
'钟,效率严重低下,使用哈希算法后耗时骤然下降,效率非常的高!查找
'65535行数据,耗时不到5秒钟,对于在海量的数据里提取重复值非常有用
'*************************************************************
Option Explicit
Private Type HashItem
     Value As String               '记录重复的值
     Count As Long                 '记录重复的次数
     Content(1 To 10) As String     '记录重复值出现的位置
End Type
Const Table_Len = 1000000         '哈希表的长度
Public Sub FindRepeate()             '查找资产号重复值
Dim HashTable(1 To Table_Len) As HashItem
Dim i As Long, j As Long, k As Long, Key As Long, oi As Long
Dim maintemp As String, wi As Long, ri As Long
Dim AssetColName As String, ws As Worksheet, arrinfo() As String, fieldname() As String
'****************************************************
'初始化系统参数
oi = 1                         '标记结果工作表中初始的写入行号
AssetColName = "E"     '标识资产号所在的列的列名,如果档案的格式不同,可以修改该值
'****************************************************
'初始化重复值结果工作表开始
Debug.Print "开始时间:" & Now
For i = 1 To Worksheets.Count   '先判断"重复值结果"工作表是否存在,如果存在则将句柄传给ws,并初始化
     If Worksheets(i).Name = "重复值结果" Then
         Set ws = Worksheets(i)
         ws.Move After:=Worksheets(Worksheets.Count)
         ws.Cells.Select
         Selection.Delete Shift:=xlUp
         GoTo toBegin
     End If
Next i
Set ws = Worksheets.Add(After:=Worksheets(Worksheets.Count))
ws.Name = "重复值结果"
toBegin:
ws.Activate
ws.Range("A" & oi & ":O" & oi).Select
ws.Range("A" & oi & ":O" & oi).Merge
ws.Range("A" & oi & ":O" & oi).Value = "表计资产号重复数据列表"
oi = oi + 2
ws.Cells.Select
Selection.NumberFormatLocal = "@"
fieldname = Split("序号,用户编号,用户名,门牌号码,表计资产号,计量点编号,表计类型,集中器号,采集器号,总配置,分配置,通讯相位,AIMS源,描述,备注", ",")
For i = 1 To UBound(fieldname) + 1
     ws.Cells(2, i) = fieldname(i - 1)
Next i
ws.Cells(1, 1).Select
'初始化重复值结果工作表结束
'****************************************************
'开始查找重复值
For i = 1 To Worksheets.Count - 1
     For j = 2 To Worksheets(i).UsedRange.Rows.Count
         maintemp = Replace(Worksheets(i).Range(AssetColName & j).Value, " ", "")
         If IsIllegal(maintemp) Then GoTo toNextRow
         Key = GetHashKey(maintemp)
         If HashTable(Key).Value = "" Then
             HashTable(Key).Value = maintemp
             HashTable(Key).Count = 1
             HashTable(Key).Content(1) = i & "," & j
         ElseIf HashTable(Key).Value = maintemp Then
             HashTable(Key).Count = HashTable(Key).Count + 1
             HashTable(Key).Content(HashTable(Key).Count) = i & "," & j
         Else
             For k = Key + 1 To Table_Len
                 If HashTable(k).Value = "" Then
                     HashTable(k).Value = maintemp
                     HashTable(k).Count = 1
                     HashTable(k).Content(1) = i & "," & j
                     Exit For
                 ElseIf HashTable(k).Value = maintemp Then
                     HashTable(k).Count = HashTable(k).Count + 1
                     HashTable(k).Content(HashTable(k).Count) = i & "," & j
                     Exit For
                 End If
             Next k
         End If
toNextRow:
     Next j
Next i
'查找正式结束
'开始统计重复值,并进行列表显示
ws.Activate
For i = 1 To Table_Len
     If HashTable(i).Count >= 2 Then
         For j = 1 To HashTable(i).Count
             Erase arrinfo
             arrinfo = Split(HashTable(i).Content(j), ",")
             wi = CLng(arrinfo(0)): ri = CLng(arrinfo(1))
             ws.Range("A" & oi & ":L" & oi).Value = Worksheets(wi).Range("A" & ri & ":L" & ri).Value
             ws.Range("M" & oi).Value = Worksheets(wi).Name & "表中第" & ri & "行"
             ws.Range(AssetColName & oi).Select
             Selection.Interior.ColorIndex = 3
             oi = oi + 1
         Next j
         ws.Range("N" & oi - HashTable(i).Count & ":N" & oi - 1).Select
         Selection.Merge
         Selection.Value = HashTable(i).Count & "行数据资产号重复"
         oi = oi + 1
     End If
Next i
'对显示出来的信息进行格式处理
ws.Activate
If oi = 3 Then
     ws.Range("A" & oi & ":O" & oi).Select
     ws.Range("A" & oi & ":O" & oi).Merge
     ws.Range("A" & oi & ":O" & oi).Value = "目前档案中暂未发现表计资产号重复数据-运气不错"
     ws.Range("A1:O" & oi).Select
     GoTo toEnd
End If
ws.Range("A1:O" & oi - 2).Select
toEnd:
With Selection
     .HorizontalAlignment = xlCenter
     .VerticalAlignment = xlCenter
     .Borders(xlEdgeLeft).LineStyle = xlContinuous
     .Borders(xlEdgeTop).LineStyle = xlContinuous
     .Borders(xlEdgeRight).LineStyle = xlContinuous
     .Borders(xlEdgeBottom).LineStyle = xlContinuous
     .Borders(xlInsideVertical).LineStyle = xlContinuous
     .Borders(xlInsideHorizontal).LineStyle = xlContinuous
End With
ws.Columns("A:O").Select
ws.Columns("A:O").EntireColumn.AutoFit
ws.Range("A1:O1").Select
Set ws = Nothing   '释放ws句柄
Debug.Print "结束时间:" & Now
'统计正式结束,任务完成
End Sub
Private Function GetHashKey(ByVal s As String) As Long
Dim i As Byte, d As Double
For i = 1 To Len(s)
     d = d - (Asc(Right(s, i)) + i) ^ 2
Next
GetHashKey = Int(Table_Len * Rnd(d) + 1)
End Function
Private Function IsIllegal(strIllegal As String) As Boolean     '检验字符串是否为非法列表中的一项
If strIllegal = "" Then IsIllegal = True: Exit Function
If Asc(Left(strIllegal, 1)) < 0 Or Asc(Left(strIllegal, 1)) > 255 Then
IsIllegal = True
Exit Function
End If
End Function

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值