pandas（1）

最新推荐文章于 2025-05-23 09:16:47 发布

优岚岚

最新推荐文章于 2025-05-23 09:16:47 发布

阅读量270

点赞数 2

文章标签： python pandas

本文链接：https://blog.youkuaiyun.com/weixin_49270402/article/details/110228018

版权

本文介绍了pandas的基础知识，包括数据处理和分析的实例。讲解了索引操作，如Interval类型的使用，以及如何通过分组进行数据处理。还涉及到实际问题的解决策略，如去除重复值、利用groupby处理数据和索引过滤。同时展示了不同颜色类别下回归分析的结果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

第一章 pandas基础

练习题

import pandas as pd
df=pd.read_csv('C:\\Users\\优岚\\Desktop\\data\\Game_of_Thrones_Script.csv')
print('该剧本中一共出现了{}个人物'.format(df['Name'].nunique()))
print('说了最多话的人：{}'.format(df['Name'].value_counts().index[0]))
df['count']=df['Sentence'].apply(lambda x:len(x.split(" "))) #计算有多少个词
df = df.groupby( 'Name' ).agg(list)     #形成列表
df['count'] = df['count'].apply( lambda x:sum(x))  #求和
print('其中说了最多的单词的人：{}'.format(df.sort_values(by='count',ascending=False).index[0])) #找到最大值,ascending参数，默认为升序True

该剧本中一共出现了564个人物
说了最多话的人：tyrion lannister
其中说了最多的单词的人：tyrion lannister

练习二

import pandas as pd
df=pd.read_csv('C:\\Users\\优岚\\Desktop\\data\\Kobe_data.csv')
df['Combination']=df['action_type']+'和'+df['combined_shot_type']
print("action_type和combined_shot_type的组合是最多的是：{}这一个组合".format(df['Combination'].value_counts().index[0]))

action_type和combined_shot_type的组合是最多的是：Jump Shot和Jump Shot这一个组合

（b）题百度一下

pd.Series(list(list(zip(*(pd.Series(list(zip(df['game_id'],df['opponent']))).unique()).tolist()))[1])).value_counts().index[0]

‘SAS’

自己写了一下，运行结果不一样…

#在所有被记录的game_id中，遭遇到最多的opponent是一个支？（由于一场比赛会有许多次投篮，但对阵的对手只有一个，本题相当于问科比和哪个队交锋次数最多）
grouped_mul = df.groupby(['game_id','opponent'])
value=grouped_mul.size()
value.sort_values(ascending=[False]).index[0]

(21501228, ‘UTA’)

于是我去请教师兄了
==> 这个是每一个game_id对应的max，题目要的是所有game_id对应的max

df.drop_duplicates(['game_id'])["opponent"].value_counts().idxmax()

df.groupby("game_id")['opponent'].unique(). value_counts().idxmax()2

data,filer_index= np.unique(df["game_id"].values,return_index=True)
df["opponent"][filer_index].value_counts().idxmax()

第一种是去重，第二种是groupby处理，第三种用的是索引过滤

第二章索引

练习一

import pandas as pd
df=pd.read_csv('C:\\Users\\优岚\\Desktop\\data\\UFO.csv')

#在所有被观测时间超过60s的时间中，哪个形状最多?
df[df['duration (seconds)']>60]['shape']. value_counts().index[0]  

#对经纬度进行划分：-180°至180°以30°为一个经度划分，-90°至90°以18°为一个纬度划分，请问哪个区域中报告的UFO事件数量最多？
#latitude纬度  longitude经度
longitude_bins=pd.interval_range(start=-180,periods=12,freq=30)  #-180°至180°以30°为一个经度划分
latitude_bins=pd.interval_range(start=-90,periods=10,freq=18)   #-90°至90°以18°为一个纬度划分
df['longitude_interval'] = pd.cut(df['longitude'],bins=longitude_bins)
df['latitude_interval'] = pd.cut(df['latitude'],bins=latitude_bins)
A=df.set_index(['longitude_interval','latitude_interval']).index.value_counts().index[0]  #统计出好该区域中报告的UFO事件数量最多
print(A)

‘light’
(Interval(-90, -60, closed=‘right’), Interval(36, 54, closed=‘right’))

练习二

import pandas as pd
import numpy as np
df=pd.read_csv('C:/Users/优岚/Desktop/data/Pokemon.csv')

#双属性的Pokemon占总体比例的多少？
df_move=df1.mask(df['Type 2']=='NaN').dropna()
Per=df_move.shape[0]/df.shape[0]
print('{:.2f}%'.format(Per*100))

#在所有种族值（Total）不小于580的Pokemon中，非神兽（Legendary=False）的比例为多少？
df[(df['Total']>=580)]['Legendary'].value_counts()   #得到true和fales的个数就可以计算比例

#在第一属性为格斗系（Fighting）的Pokemon中，物攻排名前三高的是哪些？
df[df['Type 1']=='Fighting'].sort_values(by='Attack',ascending=[False]).head(3)

#请问六项种族指标（HP、物攻、特攻、物防、特防、速度）极差的均值最大的是哪个属性（只考虑第一属性，且均值是对属性而言）？
df['range']=df.loc[:,'HP': 'Speed'].max(axis=1)-df.loc[:,'HP':'Speed'].min(axis=1)
df.sort_values(by='range',ascending=[False]).head(2)
Type_max = df[['Type 1','range']].set_index('Type 1')
'''Type_max.index.unique() ==>Index(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice','Dragon', 'Dark', 'Steel', 'Flying'],dtype='object', name='Type 1')'''
range_max=0
for i in Type_max.index.unique():
    a = Type_max.loc[i,:].mean()
    if a[0] > range_max:
        range_max = a[0]
        result1 = i
print(result1)

#哪个属性（只考虑第一属性）神兽占总Pokemon的比例最高？该属性神兽的种族值均值也是最高的吗？  神兽：Legendary == True
print(df.query('Legendary == True')['Type 1'].value_counts().index[0])
Type_high = df.query('Legendary == True')[['Type 1','Total']].set_index('Type 1')
'''Type_high.index.unique()==>  Index(['Ice', 'Electric', 'Fire', 'Psychic', 'Water', 'Rock', 'Steel',
       'Dragon', 'Ground', 'Normal', 'Ghost', 'Dark', 'Grass', 'Flying','Fairy'],dtype='object', name='Type 1')'''
max_value=0
for i in Type_high.index.unique()[:-1]:
    b = Type_high.loc[i,:].mean()
    if b[0] > max_value:
        max_value = b[0]
        result2 = i
print(result2)

在这里插入图片描述

第三章分组

import pandas as pd
import numpy as np
df=pd.read_csv('C:/Users/优岚/Desktop/data/Diamonds.csv')
df.head()  #列分别记录了克拉数、颜色、开采深度、价格

#在所有重量超过1克拉的钻石中，价格的极差是多少？
df.query('carat>1')['price'].agg(lambda x:x.max()-x.min())

#若以开采深度的0.2\0.4\0.6\0.8分位数为分组依据，每一组中钻石颜色最多的是哪一种？该种颜色是组内平均而言单位重量最贵的吗？
depth_bins=df['depth'].quantile([0,0.2,0.4,0.6,0.8,1])
depth_cuts=pd.cut(df['depth'],bins=depth_bins)
df['depth_cuts']=depth_cuts
color_result = df.groupby('depth_cuts')['color'].describe()   #以cuts分组，用describe统计color的数据
color_result

df['单位重量价格']=df['price']/df['carat']
color_result['top'] == [i[1] for i in df.groupby(['depth_cuts','color'])['单位重量价格'].mean().groupby(['depth_cuts']).idxmax().values]  #布尔型

这里是引用

depth_cuts
(43.0, 60.8] False
(60.8, 61.6] False
(61.6, 62.1] False
(62.1, 62.7] True
(62.7, 79.0] True
Name: top, dtype: bool

#以重量分组(0-0.5,0.5-1,1-1.5,1.5-2,2+)，按递增的深度为索引排序，求每组中连续的严格递增价格序列长度的最大值。
carat_bins=[0,0.5,1,1.5,2,np.inf]
carat_cuts=pd.cut(df['carat'],bins=carat_bins)
df['carat_cuts']=carat_cuts
df.sort_values(by=['carat_cuts','depth'])  #按递增的深度为索引排序

在这里插入图片描述

以下是百度答案

def f(nums):
    if not nums:        
        return 0
    result = 1                            
    cur_len = 1                        
    for i in range(1, len(nums)):      
        if nums[i-1] < nums[i]:        
            cur_len += 1                
            result = max(cur_len, result)     
        else:                       
            cur_len = 1                 
    return result
#求每组中连续的严格递增价格序列长度的最大值。
for name,group in df.groupby('carat_cuts'):  #组的遍及
    group = group.sort_values(by='depth')  #按递增的深度为索引排序
    s = group['price']
    print(name,f(s.tolist()))

(0.0, 0.5] 8
(0.5, 1.0] 8
(1.0, 1.5] 7
(1.5, 2.0] 11
(2.0, inf] 7

#请按颜色分组，分别计算价格关于克拉数的回归系数。（单变量的简单线性回归，并只使用Pandas和Numpy完成）
for name,group in df[['carat','price','color']].groupby('color'):
    L1 = np.array([np.ones(group.shape[0]),group['carat']]).reshape(2,group.shape[0])
    L2 = group['price']
    result = (np.linalg.inv(L1.dot(L1.T)).dot(L1)).dot(L2).reshape(2,1)
    print('当颜色为%s时，截距项为：%f，回归系数为：%f'%(name,result[0],result[1]))

当颜色为D时，截距项为：-2361.017152，回归系数为：8408.353126
当颜色为E时，截距项为：-2381.049600，回归系数为：8296.212783
当颜色为F时，截距项为：-2665.806191，回归系数为：8676.658344
当颜色为G时，截距项为：-2575.527643，回归系数为：8525.345779
当颜色为H时，截距项为：-2460.418046，回归系数为：7619.098320
当颜色为I时，截距项为：-2878.150356，回归系数为：7761.041169
当颜色为J时，截距项为：-2920.603337，回归系数为：7094.192092