Numpy&Pandas-优快云博客

查看帮助文档

（1）help

# help(参数)
help(sum)

（2）Shift + Tab 键

（3）?函数名

快捷键

A: 在当前单元格上方插入新单元格

B: 在当前单元格下方插入新单元格

D (连按两次): 删除当前单元格

M: 将当前单元格转换为 Markdown 格式

Y: 将当前单元格转换为代码格式

Shift + M: 合并当前单元格与下方相邻单元格

Enter: 进入编辑模式，开始编辑当前单元格内容

Shift + Enter: 运行当前单元格并跳转到下一个单元格

Ctrl + Enter: 运行当前单元格但不跳转

Alt + Enter: 运行当前单元格并在其下方插入新单元格

Nnmpy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#读取图像文件并将其转换为数组形式
pic = plt.imread('./b.png')

# 数组形状
pic.shape

# 展示图像pic
pics = plt.imshow(pic)

# 全零一维数组
np.zeros(10)
# array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


# 全1数组
np.ones(5)
# array([1., 1., 1., 1., 1.])


# 全2的二维数组
np.full((3, 4), 2)
"""
array([[2, 2, 2, 2],
       [2, 2, 2, 2],
       [2, 2, 2, 2]])
"""

arr = np.array([[0, 1], [2, 3]])
print(arr)
"""
[[0 1]
 [2 3]]
"""
print(arr[0, 1])  # 访问第一行第二列，输出1
arr[0, 1] = 5
print(arr)  # 输出修改后的数组
"""
[[0 5]
 [2 3]]
"""


arr1 = np.array([[0, 1], [2, 3]])
arr2 = np.array([[4, 5], [6, 7]])
display(arr1,arr2)
"""
array([[0, 1],
       [2, 3]])
array([[4, 5],
       [6, 7]])
"""

sum_arr = arr1 + arr2
mul_arr = arr1 * arr2
display(sum_arr,mul_arr)
"""
array([[ 4,  6],
       [ 8, 10]])
array([[ 0,  5],
       [12, 21]])
"""

arr_a = np.array([[1], [2], [3]])
arr_b = np.array([[4, 5, 6, 7]])
result = arr_a + arr_b  # 广播机制应用
display(arr_a,arr_b,result)
"""
array([[1],
       [2],
       [3]])
array([[4, 5, 6, 7]])
array([[ 5,  6,  7,  8],
       [ 6,  7,  8,  9],
       [ 7,  8,  9, 10]])
"""

retstep=True：这是一个可选参数，当设置为True时，np.linspace不仅返回生成的等差数列，还会额外返回一个浮点数，表示数列中相邻两个元素之间的步长（即每个点之间的固定距离）。

np.linspace(1,10,4,retstep = True)

# (array([ 1.,  4.,  7., 10.]), 3.0)

等比数列

np.logspace(1,10,10)
# array([1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08,1.e+09, 1.e+10])


# base 设置底数 2^1到2^10
np.logspace(1,10,10,base=2)
# array([   2.,    4.,    8.,   16.,   32.,   64.,  128.,  256.,  512.,1024.])

形状 reshape()

np.arange(1,13).reshape(3,4)
"""
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])
"""


np.arange(1,13).reshape(2,6)
"""
array([[ 1,  2,  3,  4,  5,  6],
       [ 7,  8,  9, 10, 11, 12]])
"""


np.ones((3,4))
"""
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])
"""

like 创建一个和arr一样的数组

arr = np.arange(1,13).reshape(3,4)
arr
"""
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])
"""

# 创建一个和arr一样的全为1的数组
np.ones_like(arr)
"""
array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]])
"""

# 创建一个和arr一样的全为0的数组
np.zeros_like(arr)
"""
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])
"""

单位矩阵:

identity创建标准单位矩阵，只能创建方阵。eye创建广义单位矩阵（可以创建非方阵）：np.eye(m.n.k) k=0:主对角线 k为正:高对角线 k为负:低对角线

np.identity(3)
"""
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
"""

np.eye(3)
"""
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
"""

np.eye(4,5)
"""
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.]])
"""

np.eye(4,5,0)
"""
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.]])
"""

np.eye(4,5,-1)
"""
array([[0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.]])
"""

np.eye(4,5,1)
"""
array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])
"""

变形

reshape()、transpose() 转置、T 转置、flatten() 展开成一维

arr = np.arange(1,13).reshape(3,4)
arr
"""
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])
"""

arr.transpose()
"""
array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]])
"""

arr.T
"""
array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]])
"""

arr.flatten()
"""
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
"""

逻辑运算

>、<、>=、<=、==、!=

&（与）、|（或）、~（非）或对应的函数 np.logical_and(), np.logical_or(), np.logical_not() 对布尔型数组进行操作

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
display(a,b)
"""
array([1, 2, 3])
array([4, 5, 6])
"""

a > b
"""array([False, False, False])"""

a < b
"""array([ True,  True,  True])"""

np.logical_and(a,b)
"""array([ True,  True,  True])"""

np.logical_or(a,b)
"""array([ True,  True,  True])"""

分割

# 分割为长度为2的三个数组。
arr = np.array([1, 2, 3, 4, 5, 6])
split_arrays = np.split(arr, 3)
display(arr,split_arrays)
"""
array([1, 2, 3, 4, 5, 6])
[array([1, 2]), array([3, 4]), array([5, 6])]
"""

Pandas

Series

在Pandas中，Series 是一个一维数组对象，能够保存任何数据类型（整数、字符串、浮点数、Python 对象等）。Series 类似于一维数组或列表，但带有标签的轴。这些标签（即索引）对于数据分析和操作非常有用。

创建

import pandas as pd
import numpy as np

# 1. 从列表创建Series
list_series = pd.Series([1, 2, 3, 4, 5])
"""
0    1
1    2
2    3
3    4
4    5
dtype: int64
"""

# 2. 从列表和自定义索引创建Series
index = ['a', 'b', 'c', 'd', 'e']
indexed_series = pd.Series([1, 2, 3, 4, 5], index=index)
"""
a    1
b    2
c    3
d    4
e    5
dtype: int64
"""


# 3. 从字典创建Series
dict_series = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
"""
a    1
b    2
c    3
d    4
e    5
dtype: int64
"""

# 4. 从NumPy数组创建Series
np_array = np.array([1, 2, 3, 4, 5])
np_series = pd.Series(np_array)
"""
0    1
1    2
2    3
3    4
4    5
dtype: int64
"""

# 5. 从Pandas DataFrame的列创建Series
# 首先创建一个简单的DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
# 从DataFrame的列创建Series
df_column_series = df['A']
"""
0    1
1    2
2    3
Name: A, dtype: int64
"""

属性

以这个Series为例：

index = ['a', 'b', 'c', 'd', 'e']  
indexed_series = pd.Series([1, 2, 3, 4, 5], index=index)
indexed_series.values
# [1 2 3 4 5]

values：返回Series的值数组。这个属性可以作为Pandas和Numpy中间转换的桥梁，通过它可以将Pandas中的数据格式转换为Numpy中数组的形式。

array([1, 2, 3, 4, 5], dtype=int64)

index：返回Series的索引数组。索引对于数据分析和操作非常有用，可以直接查看，也可以进行修改。

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

size：查看Series的元素个数。这是一个整数，表示Series中数据的数量。

输出5

dtype：查看数据的类型。Pandas支持多种数据类型，通过dtype属性可以了解当前Series中数据的类型，并可以通过astype方法对数据类型进行更改。

dtype('int64')

name：获取Series的名称。这个属性可以用于标识Series对象。

index.name：获取索引的名称。这个属性可以用于标识索引。

方法

以这个Series为例：

s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

head() 和 tail()：用于快速查看Series对象的头部或尾部数据，便于对数据的初步了解。

# 使用head()方法查看前几个元素  
head_elements = s.head(3)   

# 使用tail()方法查看后几个元素  
tail_elements = s.tail(2)

isnull() 和 notnull()：用于检测Series中的缺失数据（NaN值）。isnull()返回布尔值的Series，表示每个元素是否为缺失值；notnull()则相反。

# 使用isnull()方法检测缺失值  
isnull_values = s.isnull()

检测缺失数据

score = {
    '语文':99,
    '数学':np.nan,
    '英语':97
}
s = pd.Series(score)
s
"""
语文    99.0
数学     NaN
英语    97.0
"""

# isnull()判断是否为空
# pd.isnull(s)
s.isnull()
"""
语文    False
数学     True
英语    False
"""

# notnull() 判断不为空
# pd.notnull(s)
s.notnull()
"""
语文     True
数学    False
英语     True
"""




# 使用bool值索引过滤数据
data = s.isnull()
data
"""
语文    False
数学     True
英语    False
"""
# 取反  得到非空
s[~data]
"""
语文    99.0
英语    97.0
"""

data = s.notnull()
data
"""
语文     True
数学    False
英语     True
"""
s[data]
"""
语文    99.0
英语    97.0
"""

mean(), sum(), min(), max()：用于计算Series的均值、总和、最小值、最大值等统计量。

unique() 和 value_counts()：用于获取Series中的唯一值，以及每个唯一值出现的次数。

sort_values()：用于对Series进行排序，可以根据值的大小进行升序或降序排列。

apply()：用于对Series中的每个元素应用一个函数，并返回一个新的Series。

# 使用apply()方法对Series中的每个元素应用一个函数（例如平方）  
squared_series = s.apply(lambda x: x ** 2)

map()：根据提供的映射关系，对Series中的元素进行替换。

# 使用map()方法根据映射关系替换Series中的元素  
mapping = {1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five'}  
mapped_series = s.map(mapping)

索引

s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

1. 标签（Label）索引

使用 .loc[] 访问器：通过标签（通常是字符串或自定义对象）来访问数据。
示例：

value = s.loc['b'] # 访问标签为'b'的值：2

2. 位置（Positional）索引

使用 .iloc[] 访问器：通过整数位置来访问数据，类似于Python列表的索引方式。
value = s.iloc[2] # 访问位置为2的值：3

3. 切片索引

使用冒号 :：可以基于整数位置对Series进行切片，类似于Python列表的切片。
示例：
sliced_series = s[1:4] # 选择从位置1到位置3（不包括位置4）的值
注意：切片索引是基于整数位置的，而不是标签。

4. 布尔索引

使用条件表达式：基于一个布尔序列来选择Series中的值。
示例：
filtered_series = s[s > 2] # 选择所有大于2的值

5. 整数直接索引

不使用访问器：可以直接使用整数来索引Series，就像访问列表一样。这种方式在Series没有自定义索引或者不关心索引时使用。
value = s[2] # 访问位置为2的值：3 假设索引是默认的整数序列

注意事项：

当Series具有自定义索引时，使用整数直接索引可能会导致混淆，因为它会基于位置而不是标签来访问数据。在这种情况下，使用.iloc[]来明确基于位置索引是更好的做法。
.loc[]和.iloc[]的主要区别在于前者基于标签索引，后者基于位置索引。如果Series有一个非整数或非连续索引，应该特别小心使用哪种索引方式。

.loc

.loc[] 访问器允许通过标签来访问Series中的数据。标签通常是字符串或自定义对象，它们对应于Series的索引。

s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])

# 使用.loc[]通过标签访问值  
value_at_b = s.loc['b']  # 访问标签为'b'的值：20

# 也可以使用切片来通过标签访问多个值  
sliced_by_label = s.loc['b':'d']  # 选择从标签'b'到标签'd'的值（包含'd'）

.iloc

.iloc[] 访问器允许通过整数位置来访问Series中的数据。这类似于Python中列表的索引方式。

value_at_position_2 = s.iloc[2]  
# 访问位置为2的值：30（注意：位置是从0开始计数的）

# 也可以使用切片来通过位置访问多个值  
sliced_by_position = s.iloc[1:4]  # 选择从位置1到位置3的值（不包含位置4）

总结

.loc[] 是基于标签的，会查找与给定标签匹配的值。如果标签不存在，它会引发一个KeyError。
.iloc[] 是基于位置的，它只关心数据在Series中的整数位置。即使Series有一个自定义的、非整数或非连续的索引，.iloc[]也会按照整数位置来访问数据。
在某些情况下，如果Series的索引是默认的整数索引（从0开始），使用整数直接索引（如 s[2]）和.iloc[2]是等效的。但是，当索引不是默认的整数索引时，使用.iloc[]可以确保你始终是基于位置来访问数据的。

series运算

元素级运算

s = pd.Series(np.random.randint(0,100,size=10))
s + 100
s - 100
s * 2
s /3
s//5
s**2
s % 5

s1 = pd.Series(np.random.randint(0,100,size=3))
s2 = pd.Series(np.random.randint(0,100,size=3))
s1 + s2
s1 - s2
s1 * s2
s1 / s2
s1 // s2
s1 % s2

统计运算

s1 = pd.Series([1, 2, 3, 4, 5])  

# 计算均值  
mean_value = s1.mean()  
print(mean_value)  
# 输出: 3.0  
  
# 计算标准差  
std_value = s1.std()  
print(std_value)  
# 输出标准差的值  1.5811388300841898
  
# 计算总和  
sum_value = s1.sum()  
print(sum_value)  
# 输出: 15  
  
# 计算最小值  
min_value = s1.min()  
print(min_value)  
# 输出: 1  
  
# 计算最大值  
max_value = s1.max()  
print(max_value)  
# 输出: 5

当两个Series进行运算时，它们的索引会被对齐。如果索引不完全匹配，结果中缺失的索引位置将会被填充为NaN（不是一个数字）。

s1 = pd.Series(np.random.randint(0,100,size=3))
s2 = pd.Series(np.random.randint(0,100,size=4))

"""
s1
0    43
1     8
2    17

s2
0    11
1    83
2     7
3    41
"""

s1 + s2
"""
0    54.0
1    91.0
2    24.0
3     NaN
"""

改变index，3对应的依旧为NaN, 对应索引计算。

无广播机制

s2.index = [3,1,2,0]
"""
3    11
1    83
2     7
0    41
"""

s1 + s1
"""
0    84.0
1    91.0
2    24.0
3     NaN
"""

如果想要进行忽略缺失值的运算，可以使用fill_value参数来指定一个填充值。

s1.add(s2,fill_value=0)
"""
0    84.0
1    91.0
2    24.0
3    11.0
"""

DataFrame

在Pandas中，DataFrame是一个二维的、大小可变的、可以存储多种类型数据的表格型数据结构。

创建

字典创建

使用Python字典来创建DataFrame，其中字典的键是列名，值是列表或数组，表示该列的数据。

import pandas as pd  
  
# 创建一个字典，其中键是列名，值是数据列表  
data = {  
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],  
    'Age': [25, 32, 18, 47],  
    'City': ['New York', 'Paris', 'London', 'Tokyo']  
}  
  
# 从字典创建DataFrame  
df = pd.DataFrame(data)

df = pd.DataFrame(
    data = np.random.randint(10,100,size=(4,6)),
    index=['A','B','C','D'],
    columns=['语文','数学','英语','物理','化学','生物']
)

Series对象创建

import pandas as pd  
  
# 创建两个Series对象  
name_series = pd.Series(['Alice', 'Bob', 'Charlie', 'David'])  
age_series = pd.Series([25, 32, 18, 47])  
  
# 使用字典从Series对象创建DataFrame  
df = pd.DataFrame({'Name': name_series, 'Age': age_series})

从Excel文件创建

# 从Excel文件创建DataFrame
# data。xlsx里面有两个sheet，学校信息:0、学生信息:1
# 这是学生信息
Stu_df = pd.read_excel('data.xlsx',sheet_name=1)  

# 这是学校信息
Sc_df = pd.read_excel('data.xlsx',sheet_name='学校信息')

从CSV文件创建也是一样的步骤。

索引

DataFrame有两个主要的索引：行索引（也称为index）和列索引（即列名）。

行索引

行索引用于标识DataFrame中的每一行。默认情况下，当创建DataFrame时，行索引是一个从0开始的整数序列。但也可以为行指定自定义的索引。

# 创建一个简单的DataFrame  
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}  
df = pd.DataFrame(data)  
df

在这个例子中，行索引是默认的整数序列（0, 1, 2）。

自定义行索引

# 使用自定义的索引创建DataFrame  
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}  
index = ['x', 'y', 'z']  
df = pd.DataFrame(data, index=index)  
df

在这个例子中，为DataFrame指定了自定义的行索引['x', 'y', 'z']。

列索引

# 创建一个简单的DataFrame，并查看列索引  
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 32, 18]}  
df = pd.DataFrame(data)  
  
# 查看列索引  
print(df.columns)
# Index(['Name', 'Age'], dtype='object')

访问特定的单个元素

使用.loc[]或.iloc[]访问器通过行索引和列索引的组合来访问特定的单个元素。.loc[]使用标签索引，而.iloc[]使用整数位置索引。

# 创建一个简单的DataFrame  
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}  
df = pd.DataFrame(data)

# 使用.loc[]访问特定元素（通过标签）  
element = df.loc[0, 'A']  # 访问第一行、列名为'A'的元素  
print(element)  # 输出: 1  
  
# 使用.iloc[]访问特定元素（通过整数位置）  
element_position = df.iloc[0, 0]  # 访问第一行第一列的元素  
print(element_position)  # 输出: 1

访问整列或整行数据

使用列名或行标签来访问整列或整行的数据。

# 访问整列数据  
column_data = df['A']  # 访问列名为'A'的整列数据  
print(column_data)  
  
# 访问整行数据（通过.loc[]和行标签）  
row_data = df.loc[0]  # 访问第一行的数据  
print(row_data)

使用条件筛选数据

# 筛选列'A'中大于1的元素  
filtered_data = df[df['A'] > 1]  
print(filtered_data)

使用切片访问多行或多列数据

# 访问多行数据  
rows_slice = df[0:2]  # 访问第一行和第二行的数据（不包含第三行）  
print(rows_slice)  
  
# 访问多列数据  
columns_slice = df[['A', 'B']]  # 访问列名为'A'和'B'的列数据  
print(columns_slice)

`.loc[]`：基于标签的索引

.loc[]方法允许你通过行标签和列标签来访问数据。它接受行标签和列标签作为参数，并返回对应的数据。

访问单个元素

# 创建一个简单的DataFrame  
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}  
df = pd.DataFrame(data, index=['x', 'y', 'z'])  
display(df)

# 使用.loc[]访问单个元素  
element = df.loc['x', 'A']  # 访问行标签为'x'、列标签为'A'的元素  
print(element)  # 输出: 1

访问多行或多列

# 访问多行  
rows = df.loc[['x', 'z']]  
print(rows)  
  
# 访问多列  
cols = df.loc[:, ['A', 'B']]  
print(cols)

使用条件表达式

.loc[]也可以与条件表达式结合使用，用于筛选数据。

# 筛选列'A'中大于1的元素  
filtered_data = df.loc[df['A'] > 1]  
print(filtered_data)

`.iloc[]`：基于整数位置的索引

.iloc[]方法允许你通过行和列的整数位置来访问数据。它接受行和列的整数索引作为参数。

访问单个元素

# 使用.iloc[]访问单个元素  
element_position = df.iloc[0, 0]  # 访问第一行第一列的元素（基于0的索引）  
print(element_position)  # 输出: 1

访问多行或多列

使用切片或整数列表来访问多行或多列。

# 访问第一行和第二行  
rows_position = df.iloc[0:2]  # 注意：切片是左闭右开的  
print(rows_position)  
  
# 访问第一列和第二列  
cols_position = df.iloc[:, [0, 1]]  
print(cols_position)