PandasBasic基础

本文详细介绍了Pandas中的Series和DataFrame对象。Series是带索引的一维数组,可以看作特殊的字典,支持字典和切片操作。DataFrame是通用的二维数组,类似于字典和Numpy数组,可以通过多种方式创建。此外,还讨论了Index对象的特性和取值方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Pandas对象

import numpy as np
import pandas as pd

1. Series对象

带有索引数据的一维数组

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

从上面看出,数据和索引(第一列)绑定在一起

# values属性返回数据值
data.values
array([0.25, 0.5 , 0.75, 1.  ])
# index属性返回pd.Index索引对象
data.index
RangeIndex(start=0, stop=4, step=1)

通过括号索引标签取值

data[1]
0.5
data[1:3]
1    0.50
2    0.75
dtype: float64

(1)用字符串定义索引

Numpy数组是通过隐式定义的整数索引获取值

Series对象是一种显式定义的索引与数值关联

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
data['b']
0.5

可以使用不连续的索引

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data
2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64
data[5]
0.5

(2)Series作为特殊的字典

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

索引按照顺序排列,字典取值方法也可用

population['California']
38332521

还可以切片!

population['California':'Illinois']
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

(3)创建Series对象

>>> pd.Series(data, index=index)
# 可以是列表
pd.Series([2, 4, 6])
0    2
1    4
2    6
dtype: int64
# 可以是标量,会进行自动填充
pd.Series(5, index=[100, 200, 300])
100    5
200    5
300    5
dtype: int64
# 可以是字典
pd.Series({2:'a', 1:'b', 3:'c'})
2    a
1    b
3    c
dtype: object
# Series只会保留显式定义的键值对
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
3    c
2    a
dtype: object

2. DataFrame对象

(1)DataFrame作为通用的数组

按照共同的索引排列的若干个Series对象

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
area = pd.Series(area_dict)
population = pd.Series(population_dict)

states = pd.DataFrame({'population': population, 'area': area})
states
populationarea
California38332521423967
Texas26448193695662
New York19651127141297
Florida19552860170312
Illinois12882135149995

DataFrame也有一个index属性可以获取索引标签

states.index
Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

DataFrame的colums属性存放列标签的Index对象

states.columns
Index(['population', 'area'], dtype='object')

可以看作通用的Numpy二维数组,他的行和列可以通过索引获取

(2)DataFrame作为特殊的字典

states['area']
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

DataFrame中,data['col0'],只返回第一列,和NumpyArray不同

(3)创建DataFrame对象

(1)通过单个Series对象创建
pd.DataFrame(population,columns=['population'])
population
California38332521
Texas26448193
New York19651127
Florida19552860
Illinois12882135
(2)通过字典列表创建
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)
ab
000
112
224

即使某些键不存在,也会用NaN表示(Not a Number)

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
abc
01.02NaN
1NaN34.0
(3)通过Series对象字典创建
pd.DataFrame({'population': population, 'area': area})
populationarea
California38332521423967
Texas26448193695662
New York19651127141297
Florida19552860170312
Illinois12882135149995
(4)通过NumpyArray创建
pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])
foobar
a0.7396440.629162
b0.9555980.154161
c0.1806600.552287
(5)通过Numpy结构化数组
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])
pd.DataFrame(A)
AB
000.0
100.0
200.0

2. Index对象

不可变数组有序集合

ind = pd.Index([2, 3, 5, 7, 11])
ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')

Index看作不可变数组

操作类似于数组

ind[1]
3
ind[::2]
Int64Index([2, 5, 11], dtype='int64')

有很多类似Numpy的属性

print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64

区别在于,Index对象不可变,无法修改

看作有序集合

有set的用法

indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB  # intersection交集
Int64Index([3, 5, 7], dtype='int64')
indA | indB  # union并集
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA ^ indB  # symmetric difference异或
Int64Index([1, 2, 9, 11], dtype='int64')

也可以调用方法indA.intersection(indB).

取值

1. Series取值

(1)Series看作字典

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
data['b']
0.5

检测键/索引/值

'a' in data
True
data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')
list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
# 修改数据
data['e'] = 1.25
data
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

(2)Series看作一维数组

# 显式索引切片
data['a':'c']
a    0.25
b    0.50
c    0.75
dtype: float64
# 隐式索引切片
data[0:2]
a    0.25
b    0.50
dtype: float64
# 掩码
data[(data > 0.3) & (data < 0.8)]
b    0.50
c    0.75
dtype: float64
# 花哨
data[['a', 'e']]
a    0.25
e    1.25
dtype: float64

显式:包含最后一个

隐式:不包含最后一个

索引器: loc, iloc, and ix

如果Series是显式整数索引,则data[1]取值为显式索引

data[1:3]为隐式索引

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
1    a
3    b
5    c
dtype: object
# 显式
data[1]
'a'
# 隐式
data[1:3]
3    b
5    c
dtype: object

担心混淆,所以使用索引器

# 显式
data.loc[1]
'a'
# 显式
data.loc[1:3]
1    a
3    b
dtype: object
# 隐式
data.iloc[1]
'b'
# 隐式
data.iloc[1:3]
3    b
5    c
dtype: object

ix为混合

DataFrame取值

(1)DataFrame看作字典

area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
areapop
California42396738332521
Florida17031219552860
Illinois14999512882135
New York14129719651127
Texas69566226448193
data['area']
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64
data.area
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64
data.area is data['area']
True

若列名与DataFrame的方法同名的话,不能获取,并且避免用属性形式直接修改值

data.pop is data['pop']
False
# 增加新的变量
data['density'] = data['pop'] / data['area']
data
areapopdensity
California4239673833252190.413926
Florida17031219552860114.806121
Illinois1499951288213585.883763
New York14129719651127139.076746
Texas6956622644819338.018740

(2)DataFrame看作二维数组

# 查看数组数据
data.values
array([[  4.23967000e+05,   3.83325210e+07,   9.04139261e+01],
       [  1.70312000e+05,   1.95528600e+07,   1.14806121e+02],
       [  1.49995000e+05,   1.28821350e+07,   8.58837628e+01],
       [  1.41297000e+05,   1.96511270e+07,   1.39076746e+02],
       [  6.95662000e+05,   2.64481930e+07,   3.80187404e+01]])
# 进行转制
data.T
CaliforniaFloridaIllinoisNew YorkTexas
area4.239670e+051.703120e+051.499950e+051.412970e+056.956620e+05
pop3.833252e+071.955286e+071.288214e+071.965113e+072.644819e+07
density9.041393e+011.148061e+028.588376e+011.390767e+023.801874e+01
data.values[0]
array([  4.23967000e+05,   3.83325210e+07,   9.04139261e+01])
# 隐式:从0开始,左闭右开
data.iloc[:3, :2]
areapop
California42396738332521
Florida17031219552860
Illinois14999512882135
# 显式
data.loc[:'Illinois', :'pop']
areapop
California42396738332521
Florida17031219552860
Illinois14999512882135
# 混合
data.ix[:3, :'pop']
areapop
California42396738332521
Florida17031219552860
Illinois14999512882135
# 结合花哨
data.loc[data.density > 100, ['pop', 'density']]
popdensity
Florida19552860114.806121
New York19651127139.076746
# 修改数据
data.iloc[0, 2] = 90
data
areapopdensity
California4239673833252190.000000
Florida17031219552860114.806121
Illinois1499951288213585.883763
New York14129719651127139.076746
Texas6956622644819338.018740

(3)其他取值方法

# 切片
data['Florida':'Illinois']
areapopdensity
Florida17031219552860114.806121
Illinois1499951288213585.883763
data[1:3]
areapopdensity
Florida17031219552860114.806121
Illinois1499951288213585.883763
# 掩码
data[data.density > 100]
areapopdensity
Florida17031219552860114.806121
New York14129719651127139.076746
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值