Datawhale组队学习(Pandas) task9-分类数据

第九章 分类数据

import numpy as np
import pandas as pd

cat对象

cat对象的属性

category类型,处理分类类型变量,将普通序列转化成分类变量可使用astype方法

df = pd.read_csv('data/learn_pandas.csv',usecols=['Grade', 'Name', 'Gender', 'Height', 'Weight'])
df.head()
GradeNameGenderHeightWeight
0FreshmanGaopeng YangFemale158.946.0
1FreshmanChangqiang YouMale166.570.0
2SeniorMei SunMale188.989.0
3SophomoreXiaojuan SunFemaleNaN41.0
4SophomoreGaojuan YouMale174.074.0
s = df.Grade.astype('category')
s.head()
0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']
# 组成部分之一:类别本身,以Index类型存储
s.cat.categories
Index(['Freshman', 'Junior', 'Senior', 'Sophomore'], dtype='object')
# 组成部分之二:是否有序
s.cat.ordered
False
# 每个序列类别会被赋予唯一整数编号,编号取决于cat.categories中的顺序
# 通过codes访问该属性
s.cat.codes.head()
0    0
1    0
2    2
3    3
4    3
dtype: int8

类别的增加、删除和修改

# 类别的增加
s = s.cat.add_categories('Graduate')
s.cat.categories
Index(['Freshman', 'Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')
# 类别的删除
s = s.cat.remove_categories('Freshman')
s.cat.categories
Index(['Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')
# 直接设置序列的新类别,原来类别中如果存在元素不属于新类别 则被设置为缺失
s = s.cat.set_categories(['Sophomore','PhD'])
s.cat.categories
s.head()
0          NaN
1          NaN
2          NaN
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (2, object): ['Sophomore', 'PhD']
# 删除未出现在序列中的类别
s = s.cat.remove_unused_categories()  # 移除了未出现的博士类别
s.cat.categories
Index(['Sophomore'], dtype='object')
# 修改
s = s.cat.rename_categories({'Sophomore':'本科二年级学生'})
s.head()

有序分类

序的建立

s = df.Grade.astype('category')
s.head
<bound method NDFrame.head of 0       Freshman
1       Freshman
2         Senior
3      Sophomore
4      Sophomore
         ...    
195       Junior
196       Senior
197       Senior
198       Senior
199    Sophomore
Name: Grade, Length: 200, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']>
s = s.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered=True)
s.head()
0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']
s.cat.as_unordered().head()
0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Sophomore', 'Junior', 'Senior']

排序和比较

df.Grade = df.Grade.astype('category')
df.Grade = df.Grade.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered=True)
df.sort_values('Grade').head()  # 值排序
GradeNameGenderHeightWeight
0FreshmanGaopeng YangFemale158.946.0
105FreshmanQiang ShiFemale164.552.0
96FreshmanChangmei FengFemale163.856.0
88FreshmanXiaopeng HanFemale164.153.0
81FreshmanYanli ZhangFemale165.152.0
df.set_index('Grade').sort_index().head()
NameGenderHeightWeight
Grade
FreshmanGaopeng YangFemale158.946.0
FreshmanQiang ShiFemale164.552.0
FreshmanChangmei FengFemale163.856.0
FreshmanXiaopeng HanFemale164.153.0
FreshmanYanli ZhangFemale165.152.0
df.head()
GradeNameGenderHeightWeight
0FreshmanGaopeng YangFemale158.946.0
1FreshmanChangqiang YouMale166.570.0
2SeniorMei SunMale188.989.0
3SophomoreXiaojuan SunFemaleNaN41.0
4SophomoreGaojuan YouMale174.074.0
res1 = df.Grade == 'Sophomore'  # 和标量比较
res1
0      False
1      False
2      False
3       True
4       True
       ...  
195    False
196    False
197    False
198    False
199     True
Name: Grade, Length: 200, dtype: bool
res2 = df.Grade == ['PhD']*df.shape[0]  # 和同长度list
res2.head()
0    False
1    False
2    False
3    False
4    False
Name: Grade, dtype: bool
# ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']
res3 = df.Grade <= 'Sophomore'
res3.head()
0     True
1     True
2    False
3     True
4     True
Name: Grade, dtype: bool
# sample()用于从DataFrame中随机选择行和列
# n :指定获取的数量,默认为1
# axis:指定随机获取的是行还是列。0表示行,1表示列,默认为0
# weights:指定权重信息,需要与 行或者列的数目相等,为列表
# frac:百分比,随机获取的百分比比重
df_demo = df.Grade.sample(frac=1).reset_index(drop=True) # 打乱后比较
df_demo.head()
0     Freshman
1    Sophomore
2    Sophomore
3       Junior
4       Senior
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']
res4 = df.Grade <= df.Grade.sample(frac=1).reset_index(drop=True)
res4.head()
0     True
1     True
2    False
3     True
4     True
Name: Grade, dtype: bool

区间类别

利用cut和qcut进行区间构造

区间是特殊的类别,通过cut/qcut可以实现将原序列的数值特征装箱,即用区间位置代替原来的数值位置。

  • 区间默认左开右闭,需要进行调整把最小值包含进去,最小区间左端点- 0.001*(max-min)
  • 需要指定左闭右开时,需要right=False,同时最大的区间右端点 + 0.001*(max-min)
s = pd.Series([1,2])

# 左开右闭
pd.cut(s, bins=2)
0    (0.999, 1.5]
1      (1.5, 2.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 1.5] < (1.5, 2.0]]
# 左闭右开
pd.cut(s, bins=2, right=False)
0      [1.0, 1.5)
1    [1.5, 2.001)
dtype: category
Categories (2, interval[float64]): [[1.0, 1.5) < [1.5, 2.001)]
# 指定区间分割点列表
# np.infty可表示无穷大
pd.cut(s, bins=[-np.infty, 1.2, 1.8, 2.2, np.infty])
0    (-inf, 1.2]
1     (1.8, 2.2]
dtype: category
Categories (4, interval[float64]): [(-inf, 1.2] < (1.2, 1.8] < (1.8, 2.2] < (2.2, inf]]
# labels:区间名字
# retbins:是否返回分割点(默认不返回)
s = pd.Series([1,2])
res = pd.cut(s, bins=2, labels=['small', 'big'], retbins=True)
res
(0    small
 1      big
 dtype: category
 Categories (2, object): ['small' < 'big'],
 array([0.999, 1.5  , 2.   ]))
# qcut中,参数q(quantile)为整数n时,指按照n等分位数把数据分箱,还可以传入浮点列表指代相应的分位数分割点
s = df.Weight
pd.qcut(s, q=3).head()  
0    (33.999, 48.0]
1      (55.0, 89.0]
2      (55.0, 89.0]
3    (33.999, 48.0]
4      (55.0, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 48.0] < (48.0, 55.0] < (55.0, 89.0]]
pd.qcut(s, q=[0,0.2,0.8,1]).head()
0      (44.0, 69.4]
1      (69.4, 89.0]
2      (69.4, 89.0]
3    (33.999, 44.0]
4      (69.4, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 44.0] < (44.0, 69.4] < (69.4, 89.0]]

一般区间的构造

my_interval = pd.Interval(0,1,'right')

# in 判断元素是否属于区间
1 in my_interval  # True
0 in my_interval  # False
False
# overlaps 判断两个区间是否有交集
my_interval_2 = pd.Interval(0.5, 1.5, 'left')
my_interval_2.overlaps(my_interval)
True
# from_breaks类似cut/qcut
# from_breaks直接传入自定义的分割点
# cut/qcut通过计算得到的分割点
pd.IntervalIndex.from_breaks([1,3,6,10], closed='both')
IntervalIndex([[1, 3], [3, 6], [6, 10]],
              closed='both',
              dtype='interval[int64]')
# from_arrays分别传入左端点和右端点的列表
pd.IntervalIndex.from_arrays(left = [1,3,6,10], right = [5,4,9,11], closed = 'neither')
IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')
# from_tuples传入的是起点和终点元组构成的列表
pd.IntervalIndex.from_tuples([(1,5),(3,4),(6,9),(10,11)],closed='neither')
IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')
# 根据起点、终点、区间个数和区间长度构造相应区间
# 其中三个量确定的情况下,剩下一个量就确定了
pd.interval_range(start=1, end=5, periods=8)
IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')
pd.interval_range(end=5,periods=8,freq=0.5)
IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

区间的属性与方法

# s-体重
id_interval = pd.IntervalIndex(pd.cut(s, 3))
# 选出前5个展示
id_demo = id_interval[:5]
id_demo
IntervalIndex([(33.945, 52.333], (52.333, 70.667], (70.667, 89.0], (33.945, 52.333], (70.667, 89.0]],
              closed='right',
              name='Weight',
              dtype='interval[float64]')
# 展示左端点
id_demo.left
Float64Index([33.945, 52.333, 70.667, 33.945, 70.667], dtype='float64')
# 展示右端点
id_demo.right
Float64Index([52.333, 70.667, 89.0, 52.333, 89.0], dtype='float64')
# 两端点均值
id_demo.mid
Float64Index([43.138999999999996, 61.5, 79.8335, 43.138999999999996, 79.8335], dtype='float64')
# 区间长度
id_demo.length
Float64Index([18.387999999999998, 18.334000000000003, 18.333,
              18.387999999999998, 18.333],
             dtype='float64')
# 判断每个区间是否包含某元素
id_demo.contains(4)
array([False, False, False, False, False])
id_demo.contains(34)
array([ True, False, False,  True, False])
id_demo.overlaps(pd.Interval(40,60))
array([ True,  True, False,  True, False])

练一练

Ex1:统计未出现的类别


Ex2:钻石数据集

df = pd.read_csv('data/diamonds.csv')
df.head()
caratcutclarityprice
00.23IdealSI2326
10.21PremiumSI1326
20.23GoodVS1327
30.29PremiumVS2334
40.31GoodSI2335
# 分别对 df.cut 在 object 类型和 category 类型下使用 nunique 函数,并比较它们的性能
# object类型
%timeit -n 30 df.cut.nunique()
# category类型
%timeit -n 30 df.cut.astype('category').nunique()

# 突然意识到这样写有问题,应先转化类型,再比较性能
object_demo = df.cut
category_demo = df.cut.astype('category')
%timeit -n 30 object_demo.nunique()
%timeit -n 30 category_demo.nunique()
3.17 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
5.39 ms ± 610 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
3.82 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 30 loops each)
1.09 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
# 钻石的切割质量可以分为五个等级,由次到好分别是 Fair, Good, Very Good, Premium, Ideal ,纯净度有八个等级,由次到好分别是 I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF ,请对切割质量按照 由好到次 的顺序排序,相同切割质量的钻石,按照纯净度进行 由次到好 的排序。
# 考察分类排序知识点
df.cut = df.cut.astype('category').cat.reorder_categories(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], ordered=True)
df.clarity = df.clarity.astype('category').cat.reorder_categories(['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'],ordered=True)
res = df.sort_values(['cut','clarity'], ascending=[False, True])
res
caratcutclarityprice
3150.96IdealI12801
5350.96IdealI12826
5510.97IdealI12830
6531.01IdealI12844
7180.97IdealI12856
...............
412420.30FairIF1208
437780.37FairIF1440
474070.52FairIF1849
496830.52FairIF2144
501260.47FairIF2211

53940 rows × 4 columns

# 分别采用两种不同的方法,把 cut, clarity 这两列按照 由好到次 的顺序,映射到从0到n-1的整数,其中n表示类别的个数。
# 这道题没有写出来,忘记了cat.codes知识点

# df.cut.cat.categories[::-1]  # 全部倒序
# Index(['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], dtype='object')
df.cut = df.cut.cat.reorder_categories(df.cut.cat.categories[::-1])
df.clarity = df.clarity.cat.reorder_categories(df.clarity.cat.categories[::-1])

df.cut = df.cut.cat.codes # 方法一:利用cat.codes 每个类对应一个整数编号

# 使用replace映射
clarity_cat = df.clarity.cat.categories
df.clarity = df.clarity.replace(dict(zip(clarity_cat, np.arange(len(clarity_cat)))))
df.head(3)
caratcutclarityprice
00.2306326
10.2115326
20.2333327
# 对每克拉的价格按照分别按照分位数(q=[0.2, 0.4, 0.6, 0.8])与[1000, 3500, 5500, 18000]割点进行分箱得到五个类别 Very Low, Low, Mid, High, Very High ,并把按这两种分箱方法得到的 category 序列依次添加到原表中

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值