Pandas —— 对文本的处理（str的内置函数）

import pandas as pd
import numpy as np
data = {
        'math': np.random.randint(60, 100, size=3),
        'age': [12, 15, 16],
        'name': ['John', 'Smith', 'Jesse']
    }

name = 'age'
df = pd.DataFrame(data)
print(f'原始数据：\n{df}')

print(df['name'].str.split('.'))

# 数据输出
原始数据：
   math  age   name
0    94   12   John
1    75   15  Smith
2    74   16  Jesse
0     [John]
1    [Smith]
2    [Jesse]
Name: name, dtype: object

1.2 slice()

描述：从系列或索引中的每个元素中切片子字符串。

Series.str.slice(start=None, stop=None, step=None) # 切片开始位置、结束位置、步长

d = pd.Series(['aaa aaa aaa aa', 'bbbb bbbb bbb'])
print(d.str.slice(start=0, stop=10, step=2)) # [0,2,4,6,8] aaaaa  bb bb

# 数据输出
0    aaaaa
1    bb bb
dtype: object

1.3 replace()

描述：字符串替换

import pandas as pd
import numpy as np
data = {
        'math': np.random.randint(60, 100, size=3),
        'age': [12, 15, 16],
        'name': ['John', 'Smith', 'Jesse']
    }

name = 'age'
df = pd.DataFrame(data)
print(f'原始数据：\n{df}')

print(df['name'].str.replace('J', 'H'))

# 数据输出
原始数据：
   math  age   name
0    78   12   John
1    87   15  Smith
2    92   16  Jesse
0     Hohn
1    Smith
2    Hesse
Name: name, dtype: object

wrap()：按指定的线宽在序列/索引中换行字符串

1.4 splice_replace()

描述：使用给定的字符串替换指定位置字符串

Series.str.slice_replace(start=None, stop=None, repl=None) 
# 被替换字符串开始位置、结束位置、替换字符串

d = pd.Series(['aaa aaa aaa aa', 'bbbb bbbb bbb'])
print(d.str.slice_replace(start=4, stop=7, repl='ccc')) 

# 数据输出
0    aaa ccc aaa aa
1     bbbbcccbb bbb
dtype: object

1.5 cat()

描述：字符串拼接

str1.cat(str2, sep=',')  # sep指定拼接符

import pandas as pd
import numpy as np
data = {
        'math': np.random.randint(60, 100, size=3),
        'age': [12, 15, 16],
        'name': ['John', 'Smith', 'Jesse']
    }

name = 'age'
df = pd.DataFrame(data)
print(f'原始数据：\n{df}')

print(df['name'].str.cat(['1', '2', '3'], sep=' & '))

# 数据输出
原始数据：
   math  age   name
0    78   12   John
1    61   15  Smith
2    95   16  Jesse
0     John & 1
1    Smith & 2
2    Jesse & 3
Name: name, dtype: object

1.6 get()

描述：获取指定位置字符

import pandas as pd
import numpy as np
data = {
        'math': np.random.randint(60, 100, size=3),
        'age': [12, 15, 16],
        'name': ['John', 'Smith', 'Jesse']
    }

name = 'age'
df = pd.DataFrame(data)
print(f'原始数据：\n{df}')

print(df['name'].str.get(1))

# 数据输出

原始数据：
   math  age   name
0    96   12   John
1    73   15  Smith
2    98   16  Jesse
0    o
1    m
2    e
Name: name, dtype: object

2. 字符串补齐操作

2.1 pad()

描述：字符串左右补齐

Series.str.pad(width, side='left', fillchar=' ')

import pandas as pd
import numpy as np
data = {
        'math': np.random.randint(60, 100, size=3),
        'age': [12, 15, 16],
        'name': ['John', 'Smith', 'Jesse']
    }

name = 'age'
df = pd.DataFrame(data)
print(f'原始数据：\n{df}')
print(df['name'].str.pad(8, side="left", fillchar="*"))
print(df['name'].str.pad(8, side="right", fillchar="*"))

# 数据输出
原始数据：
   math  age   name
0    72   12   John
1    79   15  Smith
2    92   16  Jesse
0    ****John
1    ***Smith
2    ***Jesse
Name: name, dtype: object
0    John****
1    Smith***
2    Jesse***
Name: name, dtype: object

2.2 ljust()

Series.str.ljust(width, fillchar=' ')

2.3 rjust()

Series.str.rjust(width, fillchar=' ')

2.4 zfill()

Series.str.zfill(width)

2.5 center() 中间补齐

import pandas as pd
import numpy as np
data = {
        'math': np.random.randint(60, 100, size=3),
        'age': [12, 15, 16],
        'name': ['John', 'Smith', 'Jesse']
    }

name = 'age'
df = pd.DataFrame(data)
print(f'原始数据：\n{df}')
print(df['name'].str.center(8,  fillchar="*"))

# 数据输出
原始数据：
   math  age   name
0    99   12   John
1    92   15  Smith
2    96   16  Jesse
0    **John**
1    *Smith**
2    *Jesse**
Name: name, dtype: object

3. 字符串去空操作

3.1 strip()

描述：去除前后的空白字符

3.2 rstrip()

描述：去除后面的空白字符

3.3 lstrip()

描述：去除前面的空白字符

4. 关于字符串的查询操作

4.1 contains()

描述：判断字符串是否包含某字符串或表达式

import pandas as pd
import numpy as np
data = {
        'math': np.random.randint(60, 100, size=3),
        'age': [12, 15, 16],
        'name': ['John', 'Smith', 'Jesse']
    }

name = 'age'
df = pd.DataFrame(data)
print(f'原始数据：\n{df}')
print(f'''过滤后的结果：\n{df.query("name.str.contains('h')")}''')

# 数据输出
原始数据：
   math  age   name
0    64   12   John
1    87   15  Smith
2    64   16  Jesse
过滤后的结果：
   math  age   name
0    64   12   John
1    87   15  Smith

4.2 startswith()

Series.str.startswith(pat, na=None)

描述：测试字符串元素是否以...开头。

参数说明：

参数名	说明
pat	字符串的字符序列或元组。不接受正则表达式
na	如果测试的元素不是字符串，则显示对象。

返回值：Series or Index of bool

# pat参数为字符串序列
series = pd.Series(['Jesse', 'Bob', 'Jerry'])
result = series.str.startswith(('J', 'T'))
print(result)

# 数据输出
0     True
1    False
2     True
dtype: bool


# pat参数为字符串
series = pd.Series(['Jesse', 'Bob', 'Jerry'])
result = series.str.startswith('J')
print(result)

# 数据输出
0     True
1    False
2     True
dtype: bool


# na参数不为NAN
series = pd.Series(['Jesse', 'Bob', np.nan])
result_false = series.str.startswith('J', na=False)
result_true = series.str.startswith('J', na=True)
result_nan = series.str.startswith('J')
print(f'na参数为False的结果：\n{result_false}')
print(f'na参数为True的结果：\n{result_true}')
print(f'na参数为nan的结果：\n{result_nan}')

# 数据输出
na参数为False的结果：
0     True
1    False
2    False
dtype: bool
na参数为True的结果：
0     True
1    False
2     True
dtype: bool
na参数为nan的结果：
0     True
1    False
2      NaN
dtype: object

4.3 endswith()

Series.str.endswith(pat, na=None)

描述：测试字符串元素是否以...结尾。

参数说明：

参数名	说明
pat	字符串的字符序列或元组。不接受正则表达式
na	如果测试的元素不是字符串，则显示对象。

返回值：Series or Index of bool

4.4 findall()

Series.str.findall(pat, flags=0)

描述：在Series中查找所有出现的pat，区分大小写。

参数说明：

参数名

说明

pat

要查找的内容，str

flags

int, default 0

来自re模块的标志，例如re.IGNORECASE（默认值为0，表示没有标志）。

返回值： Series/Index of lists of strings

series = pd.Series(['Jesse', 'Bob', 'JerJy'])
result = series.str.findall('J')
print(f'结果：\n{result}')

# 数据输出
结果：
0       [J]
1        []
2    [J, J]
dtype: object

4.5 match()

Series.str.match(pat, case=True, flags=0, na=None)

描述：确定每个字符串是否以pat开头。

参数声明：

参数名	说明
pat	str 要匹配的字符串
case	bool, default True 如果为True，区分大小写
flags	int, default 0 (no flags) Regex模块标志，例如re.IGNORECASE。
na	scalar, optional 为缺少的值填充值。默认值取决于数组的数据类型。对于对象dtype，使用numpy.nan。对于StringDtype，pandas。使用NA。

返回值： Series/Index/array of boolean values

series = pd.Series(['Jesse', 'Bob', 'JerJy'])
result = series.str.match('J')
print(f'结果：\n{result}')


结果：
0     True
1    False
2     True
dtype: bool

4.6 extract()

Series.str.extract(pat, flags=0, expand=True)

描述：从字符串中提取特定的模式或匹配项。它是通过正则表达式来实现的。(提取正则表达式匹配的内容）

参数说明：

参数名	说明
pat	str 带有捕获组的正则表达式。
flags	int, default 0 (no flags) 来自re模块的标志，例如re.IGNORECASE，用于修改正则表达式匹配大小写、空格等。有关更多详细信息，请参阅re。
expand	bool, default True 如果为True，则返回每个捕获组一列的DataFrame。如果为False，如果有一个捕获组，则返回序列/索引；如果有多个捕获组则返回DataFrame。

返回值：DataFrame or Series or Index

series = pd.Series(['Jesse@163.com', 'Bob@112.com', 'JerJy'])
result = series.str.extract(r'([a-zA-Z]{3}@[0-9]{3}.com)')
print(f'结果：\n{result}')

结果：
             0
0  sse@163.com
1  Bob@112.com
2          NaN

4.7 find()

Series.str.find(sub, start=0, end=None)

描述：返回序列/索引中每个字符串的最低索引。每个返回的索引对应于子字符串完全包含在[start:end]之间的位置。失败时返回-1。相当于python的str.find（）

参数说明：

参数名	说明
sub	str 正在搜索的子字符串。
start	int 左边缘索引。
end	int 右边缘索引。

返回值：Series or Index of int.

series = pd.Series(['Jesse@163.com', 'Bob@112.com', 'JerJy'])
result = series.str.find('@', 0, 10)
print(f'结果：\n{result}')

结果：
0    5
1    3
2   -1
dtype: int64

4.8 rfind()

Series.str.rfind(sub, start=0, end=None)

描述：返回序列/索引中每个字符串的最高索引。每个返回的索引对应于子字符串完全包含在[start:end]之间的位置。失败时返回-1。相当于python的str.rfind（）

参数说明：

参数名	说明
sub	str 正在搜索的子字符串。
start	int 左边缘索引。
end	int 右边缘索引。

返回值：Series or Index of int.

series = pd.Series(['Jesse@163.com@', 'Bob@112.@com', 'JerJy'])
result_r = series.str.rfind('@', 0) # 返回最高索引
result = series.str.find('@', 0) # 返回最低索引
print(f'最高索引结果：\n{result_r}')
print(f'最低索引结果：\n{result}')

最高索引结果：
0    13
1     8
2    -1
dtype: int64
最低索引结果：
0    5
1    3
2   -1
dtype: int64

find() 与 rfind()的区别：

find()：返回被查找字符串的最低索引
rfind()：返回被查找字符串的最高索引

4.9 index()

Series.str.index(sub, start=0, end=None)

描述：返回序列/索引中每个字符串的最低索引。每个返回的索引对应于子字符串完全包含在[start:end]之间的位置。这与str.find相同，只是当找不到子字符串时，它不会返回-1，而是引发ValueError。相当于标准str.index。

参数说明：

参数名	说明
sub	str 正在搜索的子字符串。
start	int 左边缘索引。
end	int 右边缘索引。

返回值：Series orIndexof object

series = pd.Series(['Jesse@163.com@', 'Bob@112.@com', 'JerJy@'])
result_r = series.str.rindex('@') # 返回最高索引
result = series.str.index('@') # 返回最低索引
print(f'最高索引结果：\n{result_r}')
print(f'最低索引结果：\n{result}')

最高索引结果：
0    13
1     8
2     5
dtype: int64
最低索引结果：
0    5
1    3
2    5
dtype: int64

注意：当找不到子字符串时，它不会返回-1，而是引发ValueError

4.10 rindex()

Series.str.rindex(sub, start=0, end=None)

描述：返回序列/索引中每个字符串的最高索引。每个返回的索引对应于子字符串完全包含在[start:end]之间的位置。这与str.rfind相同，只是当找不到子字符串时，它不会返回-1，而是引发ValueError。相当于标准str.rindex。

参数说明：

参数名	说明
sub	str 正在搜索的子字符串。
start	int 左边缘索引。
end	int 右边缘索引。

返回值：Series or Index of object

代码示例见：4.9 index()

4.11 isalpha()

Series.str.isalpha()

描述：判断每个字符串中的所有字符是否都是字母

参数说明：无

返回值：Series or Index of bool

series = pd.Series(['jesse@163.com@', 'Bob@112.@com', 'JerJy'])
result = series.str.isalpha()
print(f'结果：\n{result}')

结果：
0    False
1    False
2     True
dtype: bool

4.12 判断字符串中的所有字符是否都是数字

4.12.1 isalnum()

Series.str.isalnum()

描述：检查每个字符串中的所有字符是否都是字母数字。这相当于为序列/索引的每个元素运行Python字符串方法str.isalnum（）。如果字符串中没有字符，则会为该检查返回False。

参数说明：无

返回值：Series or Index of bool

series = pd.Series(['jesse@163.com@', 'Bob@112.@com', 'JerJy'])
result = series.str.isalnum()
print(f'结果：\n{result}')

结果：
0    False
1    False
2     True
dtype: bool

4.12.2 isdigit()

Series.str.isdigit()

描述：判断每个字符串中的所有字符是否都是整数。如果字符串中没有字符，则返回False。

参数说明：无

返回值：Series or Index of bool

series = pd.Series(['2', '111.1', 'JerJy'])
result = series.str.isdigit()
print(f'结果：\n{result}')

结果：
0     True
1    False
2    False
dtype: bool

4.12.3 isnumeric()

Series.str.isnumeric()

描述：检查每个字符串中的所有字符是否都是数字。如果字符串中没有字符，则返回False。

参数说明：无

返回值：Series or Index of bool

4.12.4 isdecimal()

Series.str.isdecimal()

描述：检查每个字符串中的所有字符是否都是十进制字符。

参数说明：无

返回值：Series or Index of bool

isalnum、isdigit、isnumeric、isdecimal的区别

名称	功能描述	说明
isalnum	判断字符串是否都是字母或数字	无
isdigit	判断字符串是否都是数字	True:Unicode数字，byte数字（单字节），全角字（双字节），罗马数字 False：汉字数字 ValueError：无
isnumeric	判断字符串是否都是数字	True:Unicode数字，全角字（双字节），汉字数字，罗马数字 False：无 ValueError：byte数字（单字节）
isdecimal	判断字符串是否都是数字	True:Unicode数字，全角字（双字节） False：汉字数字，罗马数字 ValueError：byte数字（单字节）

series = pd.Series(['1', b'1', '1', '一', 'Ⅰ'])
result_isalnum = series.str.isalnum()
result_isnumeric = series.str.isnumeric()
result_isdigit = series.str.isdigit()
result_isdecimal = series.str.isdecimal()
print(f'isalnum结果：\n{result_isalnum}')
print(f'isnumeric结果：\n{result_isnumeric}')
print(f'isdigit结果：\n{result_isdigit}')
print(f'isdecimal结果：\n{result_isdecimal}')



isalnum结果：
0    True
1     NaN
2    True
3    True
4    True
dtype: object
isnumeric结果：
0    True
1     NaN
2    True
3    True
4    True
dtype: object
isdigit结果：
0     True
1      NaN  # 存疑，这里应为True
2     True
3    False
4    False
dtype: object
isdecimal结果：
0     True
1      NaN
2     True
3    False
4    False
dtype: object

4.13 isspace()

Series.str.isspace()

描述：检查每个字符串中的所有字符是否都是空白字符。

参数说明：无

返回值：Series or Index of bool

4.14 islower()

Series.str.islower()

描述：检查每个字符串中的所有字符是否都是小写。

参数说明：无

返回值：Series or Index of bool

4.15 isupper()

Series.str.isupper()

描述：检查每个字符串中的所有字符是否都是大写。

参数说明：无

返回值：Series or Index of bool

4.16 istitle()

Series.str.istitle()

描述：检查每个字符串中的所有字符是否都是标题类型，即首字母大写，其余小写。

参数说明：无

返回值：Series or Index of bool

5. 关于字符串的计算操作

5.1 count()

Series.str.count(pat, flags=0)

描述：统计序列/索引的每个字符串中出现的模式。此函数用于计算特定正则表达式模式在系列的每个字符串元素中重复的次数。

参数说明：

参数名	说明
pat	有效的正则表达式。
flags	re模块标志。

返回值：Series or Index

series = pd.Series(['Jesse@163.com@', 'Bob@112.@com', 'JerJy@'])
result = series.str.count('@') 
print(f'结果：\n{result}')

结果：
0    2
1    2
2    1
dtype: int64

5.2 len()

Series.str.len()

描述：计算序列/索引中每个元素的长度。

参数说明：无

返回值：Series or Index of int

series = pd.Series(['Jesse@163.com@', 'Bob@112.@com', 'JerJy@'])
result = series.str.len() 
print(f'结果：\n{result}')

结果：
0    14
1    12
2     6
dtype: int64

5.3 lower()

Series.str.lower()

描述：将序列/索引中的字符串转换为小写。

参数说明：无

返回值：Series or Index of object

series = pd.Series(['Jesse@163.com@', 'Bob@112.@com', 'JerJy@'])
result = series.str.lower()
print(f'结果：\n{result}')

结果：
0    jesse@163.com@
1      bob@112.@com
2            jerjy@
dtype: object

5.4 upper()

Series.str.upper()

描述：将序列/索引中的字符串转换为大写。

参数说明：无

返回值：Series or Index of object

series = pd.Series(['Jesse@163.com@', 'Bob@112.@com', 'JerJy@'])
result = series.str.upper()
print(f'结果：\n{result}')

结果：
0    JESSE@163.COM@
1      BOB@112.@COM
2            JERJY@
dtype: object

5.5 capitalize()

Series.str.capitalize()

描述：将序列/索引中的字符串的首字母大写。

参数说明：无

返回值：Series or Index of object

series = pd.Series(['jesse@163.com@', 'Bob@112.@com', 'JerJy@'])
result = series.str.capitalize()
print(f'结果：\n{result}')

结果：
0    Jesse@163.com@
1      Bob@112.@com
2            Jerjy@
dtype: object

5.6 swapcase()

Series.str.swapcase()

描述：将Series/index中的字符串的大小写转换

参数说明：无

返回值：Series or Index of object

series = pd.Series(['jesse@163.com@', 'Bob@112.@com', 'JerJy@'])
result = series.str.swapcase()
print(f'结果：\n{result}')

结果：
0    JESSE@163.COM@
1      bOB@112.@COM
2            jERjY@
dtype: object

---------------------------------------------------------------------------------------------------------------------------------