Pandas是一个强大的分析结构化数据的工具集;它的使用基础是Numpy(提供高性能的矩阵运算);用于数据挖掘和数据分析,同时也提供数据清洗功能。
如果还没学过numpy,可以点击这个链接偶~
https://blog.youkuaiyun.com/splnn/article/details/111596363
详细的内容可以访问pandas中文网https://www.pypandas.cn/
pandas的数据结构有两个Series和DataFrame
Series

Series是带标签的一维数组,可存储整数、浮点数、字符串、Python 对象等类型的数据。轴标签统称为索引。
创建Series
import pandas as pd
import numpy as np
#创建Series
s = pd.Series(['a','b','c','d','e'])
print(s)
# 0 a
# 1 b
# 2 c
# 3 d
# 4 e
Seris中可以使用index设置索引列表
和字典不同,Series允许索引重复
import pandas as pd
import numpy as np
#创建Series
#与字典不同的是:Series允许索引重复
s = pd.Series(['a','b','c','d','e'],index=[100,200,100,400,500])
print(s)
# 100 a
# 200 b
# 100 c
# 400 d
# 500 e
用字典实例化
import pandas as pd
import numpy as np
#创建Series
d = {'b': 1, 'a': 0, 'c': 2}
s = pd.Series(d)
print(s);
# b 1
# a 0
# c 2
通过Series的values和index属性获取其数组表示形式和索引对象
import pandas as pd
import numpy as np
#创建Series
d = {'b': 1, 'a': 0, 'c': 2}
s = pd.Series(d)
print(s.values)
print(s.index)
# [1 0 2]
# Index(['b', 'a', 'c'], dtype='object')
通过索引的方式选取Series中的单个或一组值
import pandas as pd
import numpy as np
# 创建Series
s = pd.Series(['a', 'b', 'c', 'd', 'e'], index=[100, 200, 100, 400, 500])
#与普通numpy数组相比,可以通过索引的方式选取Series中的单个或一组值
print(s[100])
print(s[[400, 500]])
# 100 a
# 100 c
# dtype: object
# 400 d
# 500 e
# dtype: object
简单的运算
import pandas as pd
import numpy as np
s = pd.Series(np.array([1,2,3,4,5]), index=['a', 'b', 'c', 'd', 'e'])
print(s)
# a 1
# b 2
# c 3
# d 4
# e 5
# dtype: int32
#对应元素求和
print(s+s)
# a 2
# b 4
# c 6
# d 8
# e 10
# dtype: int32
#对应元素乘
print(s*3)
# a 3
# b 6
# c 9
# d 12
# e 15
# dtype: int32
Series中最重要的一个功能是:它会在算术运算中自动对齐不同索引的数据
Series 和多维数组的主要区别在于, Series 之间的操作会自动基于标签对齐数据。因此,不用顾及执行计算操作的 Series 是否有相同的标签。
import pandas as pd
import numpy as np
obj1 = pd.Series({"Ohio": 35000, "Oregon": 16000, "Texas": 71000, "Utah": 5000})
print(obj1)
# Ohio 35000
# Oregon 16000
# Texas 71000
# Utah 5000
# dtype: int64
obj2 = pd.Series({"California": np.nan, "Ohio": 35000, "Oregon": 16000, "Texas": 71000})
print(obj2)
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# dtype: float64
print(obj1 + obj2)
# California NaN
# Ohio 70000.0
# Oregon 32000.0
# Texas 142000.0
# Utah NaN
# dtype: float64
索引,切片
import pandas as pd
import numpy as np
s = pd.Series(np.array([1,2,3,4,5]), index=['a', 'b', 'c', 'd', 'e'])
print(s[1:])
# b 2
# c 3
# d 4
# e 5
# dtype: int32
print(s[:-1])
# a 1
# b 2
# c 3
# d 4
# dtype: int32
print(s[1:] + s[:-1])
# a NaN
# b 4.0
# c 6.0
# d 8.0
# e NaN
# dtype: float64
DataFrame

DataFrame 是由多种类型的列构成的二维标签数据结构,类似于 Excel 、SQL 表,或 Series 对象构成的字典。DataFrame 是最常用的 Pandas 对象,与 Series 一样,DataFrame 支持多种类型的输入数据:
一维 ndarray、列表、字典、Series 字典
二维 numpy.ndarray
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)
print(frame)
# state year pop
# 0 Ohio 2000 1.5
# 1 Ohio 2001 1.7
# 2 Ohio 2002 3.6
# 3 Nevada 2001 2.4
# 4 Nevada 2002 2.9
指定排列顺序
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
#如果指定了列顺序,则DataFrame的列就会按照指定顺序进行排列
frame1 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
print(frame1)
# year state pop
# 0 2000 Ohio 1.5
# 1 2001 Ohio 1.7
# 2 2002 Ohio 3.6
# 3 2001 Nevada 2.4
# 4 2002 Nevada 2.9
如果传入的列在数据中找不到,就会产生NaN值
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
#如果指定了列顺序,则DataFrame的列就会按照指定顺序进行排列
frame1 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
print(frame1)
# year state pop
# 0 2000 Ohio 1.5
# 1 2001 Ohio 1.7
# 2 2002 Ohio 3.6
# 3 2001 Nevada 2.4
# 4 2002 Nevada 2.9
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
print(frame2)
# year state pop debt
# one 2000 Ohio 1.5 NaN
# two 2001 Ohio 1.7 NaN
# three 2002 Ohio 3.6 NaN
# four 2001 Nevada 2.4 NaN
# five 2002 Nevada 2.9 NaN
用Series或者字典生成DataFrame
import pandas as pd
import numpy as np
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
print(pd.DataFrame(d))
# one two
# a 1.0 1.0
# b 2.0 2.0
# c 3.0 3.0
# d NaN 4.0
通过类似字典标记的方式或属性的方式,可以将DataFrame的列获取为一个Series
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
#如果指定了列顺序,则DataFrame的列就会按照指定顺序进行排列
frame1 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
print(frame1)
# year state pop
# 0 2000 Ohio 1.5
# 1 2001 Ohio 1.7
# 2 2002 Ohio 3.6
# 3 2001 Nevada 2.4
# 4 2002 Nevada 2.9
print(frame1['state'])
# 0 Ohio
# 1 Ohio
# 2 Ohio
# 3 Nevada
# 4 Nevada
列可以通过赋值的方式进行修改
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [
2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
# 如果指定了列顺序,则DataFrame的列就会按照指定顺序进行排列
frame1 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
print(frame1)
# year state pop
# 0 2000 Ohio 1.5
# 1 2001 Ohio 1.7
# 2 2002 Ohio 3.6
# 3 2001 Nevada 2.4
# 4 2002 Nevada 2.9
frame1['pop'] = 5.
print(frame1)
# year state pop
# 0 2000 Ohio 5.0
# 1 2001 Ohio 5.0
# 2 2002 Ohio 5.0
# 3 2001 Nevada 5.0
# 4 2002 Nevada 5.0
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [
2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
# 如果指定了列顺序,则DataFrame的列就会按照指定顺序进行排列
frame1 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
print(frame1)
# year state pop
# 0 2000 Ohio 1.5
# 1 2001 Ohio 1.7
# 2 2002 Ohio 3.6
# 3 2001 Nevada 2.4
# 4 2002 Nevada 2.9
frame1['new'] = frame1['pop'] + 10
print(frame1)
# year state pop new
# 0 2000 Ohio 1.5 11.5
# 1 2001 Ohio 1.7 11.7
# 2 2002 Ohio 3.6 13.6
# 3 2001 Nevada 2.4 12.4
# 4 2002 Nevada 2.9 12.9
import pandas as pd
import numpy as np
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [
2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
# 如果指定了列顺序,则DataFrame的列就会按照指定顺序进行排列
frame1 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
print(frame1)
# year state pop
# 0 2000 Ohio 1.5
# 1 2001 Ohio 1.7
# 2 2002 Ohio 3.6
# 3 2001 Nevada 2.4
# 4 2002 Nevada 2.9
frame1['new'] = frame1['pop'] + 10
frame1['newnew'] = np.arange(5.)
print(frame1)
# year state pop new newnew
# 0 2000 Ohio 1.5 11.5 0.0
# 1 2001 Ohio 1.7 11.7 1.0
# 2 2002 Ohio 3.6 13.6 2.0
# 3 2001 Nevada 2.4 12.4 3.0
# 4 2002 Nevada 2.9 12.9 4.0
Pandas基础内容我学到这边,课程是在aistudio~
更多内容还是要去Pandas中文网:https://www.pypandas.cn/
本文如果有错误,或者对内容存在问题,欢迎在评论区交流~





