去除有重复的行

本文介绍如何使用Python处理CSV文件,通过特定列剔除重复项,保留唯一数据,以实现从原始文件创建不含重复标题1、2和3的预期输出CSV。作者分享了两种方法,一种是直接读取和写入,另一种是利用集合进行高效去重。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 【问题】

I have a csv file and I have duplicate as well as unique data getting add to it on a daily basis. This involves too many duplicates. I have to remove the duplicates based on specific columns. For eg:

csvfile1:

title1	title2	title3	title4	title5
abcdef 	12	13	14	15
jklmn 	12	13	56	76
abcdef 	12	13	98	89
bvnjkl 	56	76	86	96

Now, based on title1, title2 and title3 I have to remove duplicates and add the unique entries in a new csv file. As you can see abcdef row is not unique and repeats based on title1,title2 and title3 so it should be removedand the output should look like:

Expected Output CSV File:

title1 title2 title3 title4 title5
jklmn  12     13     56     76
bvnjkl 56     76     86     96

My tried code is here below:CSVINPUT file import csv

f = open("1.csv", 'a+')
writer = csv.writer(f)
writer.writerow(("t1", "t2", "t3"))
a =[["a", 'b', 'c'], ["g", "h", "i"],['a','b','c']] #This list is changed daily so new and duplicates data get added daily

for i in range(2):
    writer.writerow((a[i]))
f.close()

Duplicate removal script:

import csv
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line not in seen: continue # skip duplicate

        out_file.write(line)

My Output: 2.csv:

t1 t2 t3
a  b  c
g  h  i

Now, I do not want a b c in the 2.csv based on t1 and t2 only the unique g h i based on t1 and t2

有人给出解法但楼主表示看不懂

import csv
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set()
    seentwice = set()
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    rows = []
    for row in reader:
        if (row[0],row[1]) in seen:
            seentwice.add((row[0],row[1]))
        seen.add((row[0],row[1]))
        rows.append(row)
    for row in rows:
        if (row[0],row[1]) not in seentwice:
            writer.writerow(row)

【回答】

只要按前3个字段分组,选出成员计数等于1的组,再合并各组记录即可。如无特殊要求,此类结构化计算用SPL来实现要简单且易懂许多:

A
1=file("d:\\source.csv").import@t()
2=A1.group(title1,title2,title3).select(~.len()==1).conj()
3=file("d:\\result.csv").export@c(A2)

A1:读取文件source.csv中的内容。

A2:按前3个字段分组,选出成员计数等于1的组,再合并各组记录。

A3:将A2结果写入文件result.csv中。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值