Pandas 学习task02

最新推荐文章于 2024-07-30 18:36:42 发布

晃晃我的半瓶水

最新推荐文章于 2024-07-30 18:36:42 发布

阅读量121

点赞数

本文链接：https://blog.youkuaiyun.com/qq_41834327/article/details/111411662

版权

import numpy as np
import pandas as pd

pd.__version__

'1.1.5'

path = r"C:\Users\yongx\Desktop\data"

'''
pandas 导入文件方式有read_csv, read_table, read_excel;
常用公共参数有:
1. header : 首行是否为列名。
2. index_col : 选择索引列。
3. usecols : 指定读取列。
4. parse_dates ： 需要转化成时间的列。
5. nrows ： 指定数据读取行数。

注意读取txt文件时可通过sep(分割参数)指定分割符号，且其可使用正则表达式，因此牵扯到转义字符的要求。
'''
df_csv = pd.read_csv(path + '\\my_csv.csv')
df_txt = pd.read_table(path + '\\my_table.txt')
df_excel = pd.read_excel(path + '\\my_excel.xlsx')

'''
pandas 数据导出方式有to_csv, to_excel，可以通过to_csv将txt保存。

注意：可通过tabulate函数库中to_markdown和to_latex函数将表格快速转换为markdown和latex语言

'''

df_csv.to_csv(path + '\\my_to_csv.csv')
df_txt.to_csv(path + '\\my_to_table.txt')
df_excel.to_excel(path + '\\my_to_excel.xlsx')

print(df_csv.to_markdown())

|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |

print(df_csv.to_latex())

\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

'''
pandas 基本数据结构有一维存储对象Series和二维表格型存储对象DataFrame.

Series:组成部分有值data, 索引index, 存储类型dtype, 序列名name. 使用时可通过'.'运算符访问对应部分值.

DataFrame:相对于Series多了列索引,即创建时可通过二维数据和行列索引构造.
DataFrame可以通过[col_name]和[col_list]结合取出对应列与多个列组成的表.前者仅取出一列则返回Series,后者为多列则返回DataFrame
类似取值用法有'.iloc'(通过行号索引行数据)和'.loc'(通过行index索引行数据).

'''

'''
pandas查看数据信息函数
整体概览函数:
head : 查看前n行数据,n为可变参数.
tail : 查看倒数n行数据,n为可变参数.
info : 查看数据信息概况.
describe : 查看数据统计x信息.

部分统计函数:
sum(和)、mean(平均值)、median(中位数)、var(方差)、std(标准差)、
max(最大值)、min(最小值)、quantile(分位数)、count(非缺失值个数)、
idxmax(最大值对应索引)、idxmin(最小值对应索引)

唯一值函数：
unique : 获取序列唯一值组成的列表
nunqiue : 获取序列唯一值个数
value_counts : 获取唯一值和其对应频数
drop_duplicates : 获取多个列组合的唯一值(返回数据)
duplicated : 获取多个组合的唯一值(返回bool列表, 重复为True, 不重复为False)
'''

'''
pandas 替换函数
替换分为三类 : 映射替换, 逻辑替换, 数值替换

映射替换包括replace方法,str.replace方法及cat.codes方法
replace可直接通过列表进行替换也可通过方向替换实现将当前值替换为前一位值或后一位置

逻辑替换包括where和mask函数,二者不同在于替换条件的不同.
where为不满足条件时进行替换.
mask为满足条件时进行替换.

数值替换包括round(按精度进行四舍五入替换)方法,  abs(使用绝对值替换)方法及clip(截断替换)方法


'''

#练一练1 使用clip将超出边界的替换为自定义的值
A = pd.Series([-1, 1.2345, 100, -50])
lower = 1
upper = 100
num = 1234
A.where((A > lower) & (A < upper) , num)

'''
pandas 排序函数
可通过值排序和索引排序实现
sort_values : 通过值进行排序
sort_index : 通过索引排序
'''

'''
pandas apply方法常用在DataFrame的行迭代或列迭代中,其输入参数为以序列为输入的函数,可通过lambda匿名函数进行便捷运算

apply函数自由度很高,但是其性能却较差,相较于通过apply和pandas的内置函数处理同一任务,内置函数性能会大幅优于通过apply函数计算结果,
常在一般在确定自定义需求的语境下考虑使用.
'''

'''
pandas 窗口对象有滑动窗口rolling, expanding扩张窗口, 指数加权窗口ewm

滑窗对象rolling()
滑窗函数只能作用于滑窗对象,可以通过.rolling方法得到滑窗对象
shift : 向前取第n个元素的值
diff : 与向前第n个元素做差(不同于numpy中n阶差分diff方法)
pct_change : 与向前第n个元素相比计算增长率


扩张窗口(累计窗口)expanding()
一个动态长度的窗口, 窗口大小即以序列开始出到具体操作的对应位置,其实用的聚合函数将作用于逐步扩展的窗口上.

'''

#练一练2 实现向后的滑窗操作
A = pd.Series([1,2,3,4,5])
print("练习2 : \n", A + A.shift(-1))

# 练一练3 使用expanding对象实现cummax, cumsum, cumprod
A = pd.Series([1,2,3,4,5])
A.expanding().max()
A.expanding().sum()
mid = 1
A.expanding().apply(lambda x : mid * list(x)[-1])

在这里插入图片描述

df = pd.read_csv(path + '\\pokemon.csv')
df.head(3)

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   #        800 non-null    int64 
 1   Name     800 non-null    object
 2   Type 1   800 non-null    object
 3   Type 2   414 non-null    object
 4   Total    800 non-null    int64 
 5   HP       800 non-null    int64 
 6   Attack   800 non-null    int64 
 7   Defense  800 non-null    int64 
 8   Sp. Atk  800 non-null    int64 
 9   Sp. Def  800 non-null    int64 
 10  Speed    800 non-null    int64 
 11  valize   800 non-null    bool  
dtypes: bool(1), int64(8), object(3)
memory usage: 69.7+ KB

# 1
df['valize'] = df[df.columns[-7:]].sum(axis = 1) == df['Total'] 

# 2
df = df[~df['#'].duplicated()]
#正确答案:
dp_dup = df.drop_duplicates('#', keep = 'first')

# 2.a 
df['Type 1'].nunique()

# 2.b
df['Type 1'].value_counts()[:3]
#正确答案:
dp['Type 1'].value_counts().index[:3]

# 2.c
#正确答案:
l_full = [i + ' ' + j for i in df['Type 1'].unique() for j in (df['Type 2'].unique().tolist() + [''])]
l_part = [i + ' ' + j for i, j in zip(df['Type 1'], df['Type 2'].replace(np.nan, ''))]
res = set(l_full).difference(set(l_part))


# 3.a
df['Attack_str'] = df['Attack'].copy()
df['Attack_str'].mask(df['Attack_str'] <= 50, 0, inplace = True ) 
df['Attack_str'].mask(df['Attack_str'] >= 120, 2, inplace = True) 
df['Attack_str'].mask(df['Attack_str'] > 50  , 1, inplace = True) 
df['Attack_str'].replace({0 : 'low', 1 : 'mid', 2 : 'high'}, inplace = True)
#正确答案:
df['Attack'].mask(df['Attack']>120, 'high').mask(df['Attack']<50, 'low').mask((df['Attack'] >= 50)&(df['Attack']<=120),  'mid').head()

# 3.b
df['Type 1_str1'] = df['Type 1'].copy()
df['Type 1_str1'].apply(lambda x : x.upper())


'''
ord : 将字符转换成ASCII码
chr : 将ASCII转换成字符
'''
low = [chr(ord('a')+ i) for i in range(26)]
up = [chr(ord('A')+ i) for i in range(26)]
df['Type 1_str2'].replace(low, up, inplace = True)


#正确答案
df['Type 1'].replace({i : str.upper(i) for i in df['Type 1'].unique()}).head()
df['Type 1'].apply(lambda x : str.upper(x)).head()

# 3.c

df['deviation'] = abs(df[df.columns[-10:-4]] - df[df.columns[-10:-4]].median()).max(axis = 1)
#正确答案
df['Deviation'] = df[df.columns[-10 : -4]].apply(lambda x : np.max((x - x.median()).abs()),1)
df.sort_values('Deviation', ascending = False).head()

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	valize
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	False

在这里插入图片描述

np.random.seed(0)
s = pd.Series(np.random.randint(-1, 2, 30).cumsum())
s.head()

0   -1
1   -1
2   -2
3   -2
4   -2
dtype: int32

ans

def ewm_alpha(x, alpha = 0.2):
    win = (1 - alpha) ** np.arange(x.shape[0])[::-1]
    res = (win * x).sum() / win.sum()
    return res
s.expanding().apply(ewm_alpha).head()

s.rolling(window = 4).apply(ewm_alpha)

alpha = 0.2
weight = [i for i in range(5)]


w = np.power(alpha , weight)
[(1-alpha)**i for i in range(5)]

alpha = 0.2
s.expanding().apply(lambda x : x * np.power(1-alpha, [i for i in range(len(list(x)))]))