Pandas字符串处理_pandas str.replace-优快云博客

在前页几节中我们学到了字符串的处理如下：

df['bWendu'] = df['bWendu'].str.replace('°C', '').astype('int32')  
## 将bWendu这列中的字符串°C替换成空，然后把数据类型转换成int32，便于数据继续处理

Pandas字符串有以下几种处理：

使用方法：先获取Series的str属性，然后在属性上调用函数
只能在字符串列上使用，不能数字列上使用
DataFrame上没有str属性和处理方法
Series.str并不是python原生字符串，而是自己的一套方法，大部分和str很相似

0. 读取数据

import pandas as pd

file_path = r'C:\TELCEL_MEXICO_BOT\A\Weather.csv'
df = pd.read_csv(file_path,encoding='utf-8')
print(df.head(3)) ##查看前3行数据

        ymd bWendu yWendu tianqi fengxiang fengji  aqi aqiInfo  aqiLevel
0  1/1/2025  -25°C   -6°C   晴~多云       西北风   1-2级   59       优         2
1  1/2/2025    2°C   -9°C      阴       东南风   3-4级   48       优         1
2  1/3/2025  -11°C   -2°C   晴~多云        西风   4-8级   28       良         1

# 查看下这个数据的类型

print(df.dtypes)
ymd          object
bWendu       object
yWendu       object
tianqi       object
fengxiang    object
fengji       object
aqi           int64   ## 数字类型的列
aqiInfo      object
aqiLevel      int64   ## 数字类型的列
dtype: object

注意：数字类型的列是不可以使用Series的，只能是字符串列上使用

1. 使用Series的str属性，使用各种字符串处理函数

df['bWendu'] = df['bWendu'].str.replace('°C', '').astype('int32')
print(df['bWendu'])
0    -25
1      2
2    -11
3      0
4      3
5     10
6      7
7    -13
8      8
9      3
10     5
11     7
12     2
13   -14
14     7
15     5
16   -19
17     3
18     1
Name: bWendu, dtype: int32

print(df['yWendu'].str.isnumeric()) ## 判断yWendu这一列数据是不是数字类型
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
Name: yWendu, dtype: bool

2. 使用str的startwith, contains等得到bool的series可以做条件查询

condition = df['ymd'].str.startswith('1/2025') #全是布尔型数据
print(condition)
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
Name: ymd, dtype: bool

3. 需要多次str处理的链式操作

怎样提取20250101这样的数字月份？

先将日期1/1/2025替换成20250101的形式
提取月份字符串202501

print(df['ymd'].str.replace('/','')) #  /替换成空
0      112025
1      122025
2      132025
3      142025
4      152025
5      162025
6      172025
7      182025
8      192025
9     1102025
10    1112025
11    1122025
12    1132025
13    1142025
14    1152025
15    1162025
16    1172025
17    1182025
18    1192025
Name: ymd, dtype: object

print(df['ymd'].str.replace('/','').str[0:6]) ##取字符串的前6位
0     112025
1     122025
2     132025
3     142025
4     152025
5     162025
6     172025
7     182025
8     192025
9     110202
10    111202
11    112202
12    113202
13    114202
14    115202
15    116202
16    117202
17    118202
18    119202
Name: ymd, dtype: object

4. 使用正则表达式的处理

# 添加新列

# 定义函数，将日期转换为中文格式
def get_nianyueri(x):
    month, day, year = x['ymd'].split('/')  # 分割日期
    return f'{year}年{month}月{day}日'  # 返回中文日期
# 应用函数
df['中文日期'] = df.apply(get_nianyueri, axis=1)

# 打印结果
print(df['中文日期'])
0      2025年1月1日
1      2025年1月2日
2      2025年1月3日
3      2025年1月4日
4      2025年1月5日
5      2025年1月6日
6      2025年1月7日
7      2025年1月8日
8      2025年1月9日
9     2025年1月10日
10    2025年1月11日
11    2025年1月12日
12    2025年1月13日
13    2025年1月14日
14    2025年1月15日
15    2025年1月16日
16    2025年1月17日
17    2025年1月18日
18    2025年1月19日
Name: 中文日期, dtype: object

问题：如何将2025年1月19日中的 ‘年’，‘月’，‘日’ 去掉呢？

方法1：链式replace

print(df['中文日期'].head(3))

0    2025年1月1日
1    2025年1月2日
2    2025年1月3日
Name: 中文日期, dtype: object

a = df['中文日期'].str.replace('年','').str.replace('月','').str.replace('日','')
print(a.head(5))

0    202511
1    202512
2    202513
3    202514
4    202515
Name: 中文日期, dtype: object

方法2：使用正则表达式替换

str.replace('[年月日]', '')，这里的 [年月日] 是一个字符集，表示匹配 年、月 或 日 中的任意一个字符
但 str.replace 的默认行为是匹配普通字符串，而不是正则表达式。如果要使用正则表达式，需要设置 regex=True

print(df['中文日期'].str.replace('[年月日]','' , regex=True))
0      202511
1      202512
2      202513
3      202514
4      202515
5      202516
6      202517
7      202518
8      202519
9     2025110
10    2025111
11    2025112
12    2025113
13    2025114
14    2025115
15    2025116
16    2025117
17    2025118
18    2025119
Name: 中文日期, dtype: object