pandas学习笔记（四）——数据的聚合！！！pandas超级有用的部分

最新推荐文章于 2024-01-05 16:15:23 发布

火树阑珊

最新推荐文章于 2024-01-05 16:15:23 发布

阅读量701

点赞数

本文链接：https://blog.youkuaiyun.com/weixin_43990070/article/details/111770522

版权

####及其好用的一部分！！！
#数据分类处理的核心 groupby()函数

df = DataFrame({'item':np.random.randint(0,10,size = 100),
               'seller':np.random.randint(0,10,size = 100),
               'weight':np.random.randint(30,300,size = 100),
               'price':np.random.randint(1,10,size = 100)})
df
Out[17]: 
    item  seller  weight  price
0      6       9     114      9
1      4       3      95      8
2      1       0     193      4
3      4       5     198      8
4      1       6     130      6
..   ...     ...     ...    ...
95     6       8     235      3
96     8       9     101      2
97     5       4      66      1
98     8       8      51      8
99     7       8     134      2

.map()是对series使用的函数
Parameters:
arg : function, dict, or Series
   Mapping correspondence.

na_action : {None, ‘ignore’}, default None
   If ‘ignore’, propagate NaN values, without passing them to the mapping
   correspondence.

下面是实例代码：

s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])

0 cat
1 dog
2 NaN
3 rabbit
dtype: object

此时的参数类型为 dict
s.map({'cat': 'kitten', 'dog': 'puppy'})

0 kitten
1 puppy
2 NaN
3 NaN
dtype: object

此时的参数类型为 function
s.map('I am a {}'.format)#原函数是format(Parameters)在map中使用时将(Parameters)省略
1
0 I am a cat
1 I am a dog
2 I am a nan
3 I am a rabbit
dtype: object

若不想对NaN值使用参数中的 function ， 参数 na_tion=‘ignore’
s.map('I am a {}'.format, na_action='ignore')
1
0 I am a cat
1 I am a dog
2 NaN
3 I am a rabbit
dtype: object

#下面构造一个数据集：
def convert_item(x):
    if x<2:
        return'白菜'
    elif x<7:
        return'萝卜'
    else:
        return'黄瓜'
df['item'] = df['item'].map(convert_item)#.map与.transform的用法相同
Out[24]: 
   item  seller  weight  price
0    萝卜       9     114      9
1    萝卜       3      95      8
2    白菜       0     193      4
3    萝卜       5     198      8
4    白菜       6     130      6
..  ...     ...     ...    ...
95   萝卜       8     235      3
96   黄瓜       9     101      2
97   萝卜       4      66      1
98   黄瓜       8      51      8
99   黄瓜       8     134      2
def convert_seller(x):
    if x<2:
        return'张大爷'
    elif x<7:
        return'赵大妈'
    else:
        return'王姨'
df['seller'] = df['seller'].map(convert_seller)
def convert_weight(x):
    return round(x,-1)
df['weight'] = df['weight'].map(convert_weight)

Out[30]: 
   item seller  weight  price
0    萝卜     王姨     110      9
1    萝卜    赵大妈     100      8
2    白菜    张大爷     190      4
3    萝卜    赵大妈     200      8
4    白菜    赵大妈     130      6
..  ...    ...     ...    ...
95   萝卜     王姨     240      3
96   黄瓜     王姨     100      2
97   萝卜    赵大妈      70      1
98   黄瓜     王姨      50      8
99   黄瓜     王姨     130      2

#统计每个人卖了多少斤蔬菜
ret = df.groupby(['seller'])['weight']
ret
Out[33]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x0000021A489FDF48>
#分好组了,但是还没有进行计算

ret.sum()#还有min，max,var等等许多
Out[34]: 
seller
张大爷    4950
王姨     4470
赵大妈    5210
Name: weight, dtype: int64

ret.apply(np.sum)#和上面效果等同
ret.count()
Out[35]: 
seller
张大爷    27
王姨     33
赵大妈    40
Name: weight, dtype: int64

#也可以用自己写的方法
def count(x):
    return x.size
ret.apply(count)
Out[36]: 
seller
张大爷    27
王姨     33
赵大妈    40
Name: weight, dtype: int64

#这里没报错但是显示有问题，可能是dataframe中的元素不可以是series吧，希望有大佬弄懂可以解答一下
def describe(x):
    return Series(data = list([x.max,x.min,x.mean]),
                  index = list(['max','min','mean']))
ret.apply(describe)

Out[38]: 
seller      
张大爷     max     <bound method Series.max of 2     190\n7     1...
        min     <bound method Series.min of 2     190\n7     1...
        mean    <bound method Series.mean of 2     190\n7     ...
王姨      max     <bound method Series.max of 0     110\n11    1...
        min     <bound method Series.min of 0     110\n11    1...
        mean    <bound method Series.mean of 0     110\n11    ...
赵大妈     max     <bound method Series.max of 1     100\n3     2...
        min     <bound method Series.min of 1     100\n3     2...
        mean    <bound method Series.mean of 1     100\n3     ...
Name: weight, dtype: object

#groubpby多个列：多层分组聚合后就是一个有多层索引的dataframe
g = df.groupby(['seller','item'])
g.mean()
                 weight     price
seller item                      
张大爷    白菜    185.000000  4.000000
       萝卜    188.333333  6.166667
       黄瓜    175.555556  5.444444
王姨     白菜    168.000000  4.600000
       萝卜    130.588235  4.294118
       黄瓜    128.181818  3.272727
赵大妈    白菜    162.500000  5.250000
       萝卜    123.600000  5.120000
       黄瓜    133.636364  4.454545

g.apply(np.std)
g.agg('std')#这两个函数作用效果相同，但是agg可以求好几个，如下
g.agg(['sum','std','max'])
Out[42]: 
            weight                 price              
               sum        std  max   sum       std max
seller item                                           
张大爷    白菜     1110  84.320816  300    24  2.683282   7
       萝卜     2260  80.321324  290    74  3.214550   9
       黄瓜     1580  67.474275  290    49  2.603417   9
王姨     白菜      840  57.619441  260    23  2.880972   8
       萝卜     2220  80.426254  280    73  2.931823   9
       黄瓜     1410  67.500842  270    36  2.327699   8
赵大妈    白菜      650  89.953692  240    21  3.304038   9
       萝卜     3090  72.449983  280   128  2.773686   9
       黄瓜     1470  77.494868  290    49  2.805838   8
#agg还可以用字典的形式
g.agg({'weight':'sum','price':'max'})#对weight求和，对price求最大
Out[43]: 
             weight  price
seller item               
张大爷    白菜      1110      7
       萝卜      2260      9
       黄瓜      1580      9
王姨     白菜       840      8
       萝卜      2220      9
       黄瓜      1410      8
赵大妈    白菜       650      9
       萝卜      3090      9
       黄瓜      1470      8