pandas.DataFrame.duplicated
DataFrame.duplicated
(subset=None, keep=first)
返回布尔类型的Series结构表示有重复值的行,True表示是重复值(行)
参数
subset: column label or sequence of labels, optional
可以指定检测某一列是否有重复值。默认将检测pandas数据中是否有重复行
keep: {first, last, False}, default first
first
: 对于所有重复值,标记除第一次出现的重复值,默认。
last
: 对于所有重复值,标记除最后一次出现的重复值
False
: 标记所有重复值
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
'rating': [4, 4, 3.5, 15, 5]
})
df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
df.duplicated()
0 False
1 True
2 False
3 False
4 False
dtype: bool
pandas.DataFrame.drop_duplicates
DataFrame.``drop_duplicates
(subset=None, keep=‘first’, inplace=False, ignore_index=False)
返回已去重的DataFrame结构,默认保留第一次出现的行(值)、非原地操作、不为去重后的行添加默认索引
参数
-
subset: column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.
-
keep: {‘first’, ‘last’, False}, default ‘first’
同pandas.DataFrame.duplicated()
-
inplace: bool, default False
Whether to drop duplicates in place or to return a copy.
-
ignore_index: bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.New in version 1.0.0.
Returns
-
DataFrame or None
DataFrame with duplicates removed or None if
inplace=True
.
pandas.Series.value_counts
Series.value_counts
(normalize=False, sort=True, ascending=False, bins=None, dropna=True)
统计各种值出现的次数,默认降序排列,以便将次数最多的值(除NA)置顶
index = pd.Index([3, 1, 2, 3, 4, np.nan])
index.value_counts()
3.0 2
2.0 1
4.0 1
1.0 1
dtype: int64