pandas三、函数操作、maps处理数据

最新推荐文章于 2024-05-06 22:27:01 发布

liuhehe123

最新推荐文章于 2024-05-06 22:27:01 发布

阅读量697

点赞数 1

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/liuhehe123/article/details/85786200

python操作数据同时被 2 个专栏收录

13 篇文章

订阅专栏

pandas

9 篇文章

订阅专栏

本文介绍如何使用Python的Pandas库处理葡萄酒评论数据，包括求均值、去重、统计频率、标准化价格、寻找性价比最高的酒款、关键词频率统计及自定义函数应用。

本节讲述如何调整数据成我们想要的样子

1、求某一列数据的均值

import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
median_points = reviews.points.median()

#  88.0
'''
另外可以使用 numpy.mean(reviews.points)
结果： 88.44713820775404
'''

2、求数据中有多少国家即有多少不同的值去重操作

countries = reviews.country.unique()
countries
'''
array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia', 'Luxembourg',
       'Croatia', 'Georgia', 'Uruguay', 'England', 'Lebanon', 'Serbia',
       'Brazil', 'Moldova', 'Morocco', 'Peru', 'India', 'Bulgaria',
       'Cyprus', 'Armenia', 'Switzerland', 'Bosnia and Herzegovina',
       'Ukraine', 'Slovakia', 'Macedonia', 'China', 'Egypt'], dtype=object)
'''

3、怎样统计一个国家在数据中出现频率呢？

每个国家/地区出现在数据集中的频率如何？创建一个系列`reviews_per_country`，将国家/地区映射到该国的葡萄酒评论数量。

reviews_per_country = reviews.country.value_counts()
reviews_per_country 
'''
US          54504
France      22093
            ...  
Slovakia        1
China           1
Name: country, Length: 43, dtype: int64
'''

4、求物品的值减去自己的平均值，这在机器学习中经常用到。

reviews.price - reviews.price.median()
'''
0          NaN
1        -10.0
          ... 
129969     7.0
129970    -4.0
Name: price, Length: 129971, dtype: float64
'''

5、哪种酒是“最便宜的”？使用数据集中具有最高点价比的葡萄酒标题，创建变量`bargain_wine`。

点价比 = pionts / prices

(reviews.points / reviews.price)
'''
0              NaN
1         5.800000
            ...   
129969    2.812500
129970    4.285714
Length: 129971, dtype: float64
'''

这里我们要最大值，要用到idxmax()这个函数。下面是这个函数的简单介绍。

Series.idxmax(axis=0, skipna=True, *args, **kwargs)[source]

Return the row label of the maximum value.

If multiple values equal the maximum, the first row label with that value is returned.

Parameters:	skipna : boolean, default True Exclude NA/null values. If the entire Series is NA, the result will be NA. axis : int, default 0 For compatibility with DataFrame.idxmax. Redundant for application on Series. args, *kwargs Additional keywors have no effect but might be accepted for compatibility with NumPy.
Returns:	idxmax : Index of maximum of values.
Raises:	ValueError If the Series is empty.

Parameters:

skipna : boolean, default True

Exclude NA/null values. If the entire Series is NA, the result will be NA.

axis : int, default 0

For compatibility with DataFrame.idxmax. Redundant for application on Series.

*args, **kwargs

Additional keywors have no effect but might be accepted for compatibility with NumPy.

Returns:

idxmax : Index of maximum of values.

Raises:

ValueError

If the Series is empty.

正如上面看到的 idxmax()的作用·就是但会标签列中最大的一列。

(reviews.points / reviews.price).idxmax()

# 64590

这还没算完，我们要找出到底是哪种酒，那么滴物超所值？

bargin_wine_index= (reviews.points / reviews.price).idxmax()
bargin_wine = reviews.loc[bargin_wine_index, 'title']
# 这里为什么不用 iloc[bargin_wine_index, 'title'] 这样用是错的
# reviews.iloc[bargin_wine_index, 10]  即还要数 title是位于第几列  loc以列进行搜查
bargin_wine
'''
'Bandit NV Merlot (California)'
'''

6、在描述一瓶葡萄酒时，您可以使用的字数太多了。葡萄酒更可能是“tropical”还是“fruity”？创建一个系列`descriptor_counts`，计算这两个单词中每个单词在数据集的`description`列中出现的次数。

这里要用到map,map()函数可以用于Series对象或DataFrame对象的一列，接收函数作为或字典对象作为参数，返回经过函数或字典映射处理后的值。

reviews.description.map(lambda desc: "tropical" in desc)
'''
这里使用map 和 lambda进行配合 
lambda编写匿名函数 用于 description列中 对每一列进行判断是否 包含有 tropical
上面处理结果：
0          True
1         False
          ...  
129969    False
129970    False
Name: description, Length: 129971, dtype: bool 
'''

当然我们要留下我们要求的，即有多少符合我们用sum函数进行统计为true的。

reviews.description.map(lambda desc: "tropical" in desc).sum()
# 3607

类似的处理fruity也一样。

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
descriptor_counts 

'''
Out[69]:
tropical    3607
fruity      9090
dtype: int64
'''

7、写个函数应用在列方向上处理一些需求

我们想在我们的网站上举办这些葡萄酒评论，但是评分系统从80到100分不太难理解 - 我们希望将它们翻译成简单的星级评分。得分为95或更高计为3星，得分至少为85但低于95为2星。任何其他分数是1星。此外，加拿大葡萄酒商协会在该网站上购买了大量广告，因此任何来自加拿大的葡萄酒都应自动获得3星，无论积分如何。创建一个系列`star_ratings`，其中包含与数据集中每个评论相对应的星号数。

时需要我们写一个函数处理上面稍多些的操作。

def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')
# 上面表示 apply 第一个参数 为函数名 不加() , axis 表示应用再列方向上。
star_ratings 
'''
0         2
1         2
         ..
129969    2
129970    2
Length: 129971, dtype: int64
'''