pandas的简单实用

最新推荐文章于 2023-11-30 00:10:13 发布

原创最新推荐文章于 2023-11-30 00:10:13 发布 · 187 阅读

0 ·

CC 4.0 BY-SA版权

机器学习专栏收录该内容

3 篇文章

订阅专栏

pandas里面经常实用的有两个对象:Series和DataFrame

Series

Series是表示一维。

s = pd.Series([1, 3, 5, np.nan, 6, 8])   
--------------
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

前面的0，1，2等可以自定义，也可以字典形式默认给出

data = {"a":1,"b":2,"c":3}
s = pd.Series(data)   
--------------
a    1
b    2
c    3

#也可以自定义
data = {"a":1,"b":2,"c":3}
s = pd.Series(data,index=['aa','bb','cc'])

DataFrame

DataFrame类似于exce和数据库这种结构化数据

df2 = pd.DataFrame({'A': 1,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
--------------------------------------
   A          B    C  D      E    F
0  1 2013-01-02  1.0  3   test  foo
1  1 2013-01-02  1.0  3  train  foo
2  1 2013-01-02  1.0  3   test  foo
3  1 2013-01-02  1.0  3  train  foo

也可以实用获取指定的行或者列

df2['A'] #对列进行筛选
df2[0:3]  #对行进行筛选
df2.loc[0:2,["A","B"]]) #可以按行和列进行筛选，第一个是行，第二个是列
df2.iloc[0:2,0:2]  #和上面一样，不过上面是按照标签来算，下面是按索引

也可以快速获得统计数据

df.describe()
--------------------
         A    C    D
count  4.0  4.0  4.0   #和
mean   1.0  1.0  3.0   #平均值
std    0.0  0.0  0.0   #标准差
min    1.0  1.0  3.0  #最小
25%    1.0  1.0  3.0  
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0  #最大

转置

df.T #行和列互换

排序

print(df2.sort_index(axis=0,ascending=False))  #安装行排序
print(df2.sort_index(axis=1,ascending=False))  #按照列排序
print(df2.sort_values(by='C',ascending=False))  #按照指定列排序

赋值

df2["A"] = [1,2,3,4]  #A咧重新赋值 也可以用loc和iloc赋值
df2[df2>0] =  -df2  #df2中大于0的都为相反数

对于nan值的处理

df1.dropna(how='any')  #删除空值的行
df1.fillna(value=5) #为空值赋值为5

例子

将excel按照某个字段分为单独的sheet

 iris = pd.read_excel('./3000条数据最终版.xlsx')  # 读入数据文件
    class_list = list(iris['定位国家-中文'].drop_duplicates())  # 获取数据class列，去重并放入列表
    # 第三步：按照类别分sheet存放数据
    writer = pd.ExcelWriter('./iris_sheets.xlsx')  # 创建数据存放路径
    for i in class_list:
        iris1 = iris[iris['定位国家-中文'] == i]
        iris1.to_excel(writer, i)
    writer.save()  # 文件保存
    writer.close()  # 文件关闭