[Panads数据分析-02]Pandas数据结构之DataFrame

本文介绍了Pandas库中DataFrame的基本概念及使用方法,包括构建DataFrame的不同方式、如何通过索引访问数据、修改列值以及处理缺失值等内容。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

# DataFrame是一个表格型的数据结构,它包含一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。
# DataFrame即有行索引下标又有列索引下标,它可以看做是有Series组成的字典(共用同一个索引)
# 构建DataFrame的方法有很多,最常用的一种是直接传入一个由等长列表或Numpy数组组成的字典:
import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2001],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)
frame
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
popstateyear
01.5Ohio2000
11.7Ohio2001
23.6Ohio2002
32.4Nevada2001
42.9Nevada2001
# 如果指定了列序列,则DataFrame的列就会按照指定顺序进行排列
pd.DataFrame(data, columns=['year', 'state', 'pop'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
yearstatepop
02000Ohio1.5
12001Ohio1.7
22002Ohio3.6
32001Nevada2.4
42001Nevada2.9
# 跟Series一样,如果传入的列在数据中找不到,就会产生NA值
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                 index=['one', 'two', 'three', 'four', 'five'])
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7NaN
three2002Ohio3.6NaN
four2001Nevada2.4NaN
five2001Nevada2.9NaN
frame2.columns
Index([‘year’, ‘state’, ‘pop’, ‘debt’], dtype=’object’)
# 通过类似字典的标记的方式或属性的方式,可以将DataFrame的列获取为一个Series
frame2['state']
one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object
frame2.year
one 2000 two 2001 three 2002 four 2001 five 2001 Name: year, dtype: int64
# 返回的Series拥有原来DataFrame相同的索引下标,且其name属性也已经相应地设置好了。行也可以通过位置或名称的方式进行获取,比如用索引下标ix
frame2.ix['three']
year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object
# 列可以通过赋值的方式进行修改。如:我们可以给那个空的"debt"列赋上一个标量值或一组值
frame2['debt'] = 16.5
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
yearstatepopdebt
one2000Ohio1.516.5
two2001Ohio1.716.5
three2002Ohio3.616.5
four2001Nevada2.416.5
five2001Nevada2.916.5
import numpy as np
frame2['debt'] = np.arange(5.)
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
yearstatepopdebt
one2000Ohio1.50.0
two2001Ohio1.71.0
three2002Ohio3.62.0
four2001Nevada2.43.0
five2001Nevada2.94.0
# 将列表或数组赋给某个列时,其长度必须跟DataFrame的长度相匹配。
# 如果赋值的是一个Series,就会精确匹配到DataFrame的索引,所有的空位都将被填上缺失值:
val = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7-1.2
three2002Ohio3.6NaN
four2001Nevada2.4-1.5
five2001Nevada2.9-1.7
# 为不存在的列赋值会创建出一个新列。关键字del用于删除列:
frame2['eastern'] = frame2.state == 'Ohio'
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
yearstatepopdebteastern
one2000Ohio1.5NaNTrue
two2001Ohio1.7-1.2True
three2002Ohio3.6NaNTrue
four2001Nevada2.4-1.5False
five2001Nevada2.9-1.7False
del frame2['eastern']
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7-1.2
three2002Ohio3.6NaN
four2001Nevada2.4-1.5
five2001Nevada2.9-1.7
frame2.columns
Index([‘year’, ‘state’, ‘pop’, ‘debt’], dtype=’object’)
# 另外一种数据形式是嵌套字典(也就是字典的字典)
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
NevadaOhio
2000NaN1.5
20012.41.7
20022.93.6
# 可以对该结果进行装置
frame3.T
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
200020012002
NevadaNaN2.42.9
Ohio1.51.73.6
# 内层字典的键会被合并、排序以形成最终的索引
# 如果显示的指定索引,则不会这样
pd.DataFrame(pop, index=[2001, 2002, 2003])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
NevadaOhio
20012.41.7
20022.93.6
2003NaNNaN
# 由Series组成的字典差不多也是一样的用法:
pdata = {'Ohio': frame3['Ohio'][:-1],
        'Nevadd': frame3['Nevada'][:2]}
pd.DataFrame(pdata)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
NevaddOhio
2000NaN1.5
20012.41.7
# 如果设置了DataFrame的index和columns的name属性,则这些信息也会被显示出来:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
stateNevadaOhio
year
2000NaN1.5
20012.41.7
20022.93.6
# 跟Series一样,values属性也会以二维ndarry的形式返回DataFrame中的数据:
frame3.values
array([[ nan, 1.5], [ 2.4, 1.7], [ 2.9, 3.6]])
# 如果DateFrame各列的数据类型不同,则值数组的数据类型就会被选用能兼容所有列数据类型:
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
yearstatepopdebt
one2000Ohio1.5NaN
two2001Ohio1.7-1.2
three2002Ohio3.6NaN
four2001Nevada2.4-1.5
five2001Nevada2.9-1.7
frame2.values
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2001, 'Nevada', 2.9, -1.7]], dtype=object)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值