数据去除重复值

本文介绍如何使用Python的pandas库处理数据集中的重复值,包括查找、标记和删除重复记录的方法,适用于数据清洗和预处理阶段。

查看重复值

dataframe.duplicated( )

删除重复值

dataframe.drop_duplicates( )

trips1.csv

start_id,end_id,start_date
0,55,55,'8/29/2013 14:13'
1,55,55,'8/29/2013 14:13'
2,55,55,'8/29/2013 14:13'
3,55,55,'8/29/2013 14:13'
4,55,55,'8/29/2013 14:13'
5,55,55,'8/29/2013 14:13'

trips2.csv

start_id,end_id,start_date
55,55,'8/29/2013 15:13'
55,55,'8/29/2013 16:13'
55,55,'8/29/2013 17:13'
55,55,'8/29/2013 18:13'
55,55,'8/29/2013 19:13'
55,55,'8/29/2013 20:13'

trips3.csv

start_id,end_id,start_date
55,55,'8/29/2013 14:13'
55,55,'8/29/2013 14:14'
55,55,'8/29/2013 14:15'
55,55,'8/29/2013 14:16'
55,55,'8/29/2013 14:17'
55,55,'8/29/2013 14:18'

代码如下:

import pandas as pd
stations = pd.read_csv('stations.csv',encoding='utf-8')
print(stations.head())
#    id        name     adderss
# 0  55  '#GuoMao#'   'Beijing'
# 1  55     '#SUN#'  'ShangHai'
# 2  55    '#Park#'   'Beijing'
# 3  55   '#Light#'  'Shanghai'
# 4  55    '#Dark#'  'ShanDong'

trips1 = pd.read_csv('trips1.csv',encoding='utf-8')
print(trips1.shape)
# (6, 3)

trips2 = pd.read_csv('trips2.csv',encoding='utf-8')
print(trips1.shape)
# (6, 3)

trips3 = pd.read_csv('trips3.csv',encoding='utf-8')
print(trips1.shape)
# (6, 3)

#将trips1、trips2、trips3合并为一个Dataframe,命名为trips
trips = pd.concat([trips1,trips2,trips3])
print(trips.shape)
# (18, 3)

#将stations中所有列名称前添加字段‘start_’,并将start_id设置为列索引
stations.columns = stations.columns.map(lambda x:'start_'+x)
#inplace为true时,对原数据进行修改
# stations.set_index(['start_id'],inplace=True)
print('stations----')
print(stations)
#    start_id  start_name start_adderss
# 0        55  '#GuoMao#'     'Beijing'
# 1        55     '#SUN#'    'ShangHai'
# 2        55    '#Park#'     'Beijing'
# 3        55   '#Light#'    'Shanghai'
# 4        55    '#Dark#'    'ShanDong'

print('-----trips-----')
print(trips)
#将trips和stations按照起始车站id进行字段匹配并合并,保留所有匹配成功的信息
print('----合并后----')
trips_stations = trips.merge(stations,on='start_id')
print(trips_stations)
#    start_id  end_id         start_date
# 0        55      55  '8/29/2013 14:13'
# 1        55      55  '8/29/2013 14:13'
# 2        55      55  '8/29/2013 14:13'
# 3        55      55  '8/29/2013 14:13'
# 4        55      55  '8/29/2013 14:13'
# 5        55      55  '8/29/2013 14:13'
# 0        55      55  '8/29/2013 15:13'
# 1        55      55  '8/29/2013 16:13'
# 2        55      55  '8/29/2013 17:13'
# 3        55      55  '8/29/2013 18:13'
# 4        55      55  '8/29/2013 19:13'
# 5        55      55  '8/29/2013 20:13'
# 0        55      55  '8/29/2013 14:13'
# 1        55      55  '8/29/2013 14:14'
# 2        55      55  '8/29/2013 14:15'
# 3        55      55  '8/29/2013 14:16'
# 4        55      55  '8/29/2013 14:17'
# 5        55      55  '8/29/2013 14:18'


#将trips_stations导出为'将'trips_stations.csv'文件
trips_stations.to_csv('trips_stations.csv')

#duplicated是判断一整行的数据,每一行的数据完全相同为True
print(trips_stations.duplicated())

#判断某一列的重复
print(trips_stations.duplicated('start_id'))

#查看trips_stations中是否包含有重复值,并将重复值删除
trips_stations_dup = trips_stations.duplicated()
print(trips_stations[trips_stations_dup])
print('-'*50)
trips_stations = trips_stations.drop_duplicates()
print(trips_stations)

练习题

已知某一dataframe中含有重复样本,以下代码中可以成功将重复样本删除的是:

dataframe.drop_duplicates(inplace = True)
或
dataframe = dataframe.drop_duplicates()

 

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值