Python学习笔记-数据分析-Pandas05—字符串数据

最新推荐文章于 2024-05-24 11:06:35 发布

原创最新推荐文章于 2024-05-24 11:06:35 发布 · 283 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#文本 #字符串 #lower，len，startswith，endswi #strip，lstrip，rstrip #索引，split、rsplit，repla

Python-学习笔记-数据分析专栏收录该内容

12 篇文章

订阅专栏

本文详细介绍了使用Pandas处理字符串数据的各种方法，包括统计、大小写转换、替换、分割及索引等操作，展示了如何高效地对DataFrame和Series中的文本数据进行预处理。

Pandas有一些专门针对字符串数据的处理方法，方便对字符串进行操作。

- python中也有类似的方法。

- 字符串是一个不可变序列。

import numpy as np
import pandas as pd
# 利用np.nan来填充空值
ps = pd.Series(['he','b','c','D','Python','666',np.nan,'hello'])
df = pd.DataFrame({'key1':list('abcdef'),
                  'key2':['haaa','pandas','s','numpy','777',np.nan]})
# 这里注意一下，因为是文本型的所以类型就是object
# 对于所有的缺失值，不做任何处理的保留。
print(ps)
print('--------------------------')
print(df)
print('--------------------------')

# 通过.str来调用各种方法,而且Series和DataFrame都可以。
print('统计一下he有多少个：\n',ps.str.count('he'))
print('--------------------------')
print('大写: \n',df['key2'].str.upper())

# 利用str.方法生成的数据是一个新的数据。而不是在原来的数据上进行修改。
# 通过id()这个方法来看一下。可以发现两个id是不一样。
print(id(df))
print('--------------------------')
print(id(df['key2'].str.upper()))
print('--------------------------')

# 通过赋值来修改原数据
df['key2'] = df['key2'].str.upper()
print(df)
# 注意：这样df = df['key2'].str.upper()是会修改原数据的结构。而不知修改数据。
# print(df)
print('--------------------------')

df1 = pd.DataFrame({'key1':list('abcdef'),
                  'key2':['u','v','w','x','Y','z']})
df1.columns = df1.columns.str.upper()
print('注意columns从key变成了KEY: \n',df1)

运行结果如下：

0        he
1         b
2         c
3         D
4    Python
5       666
6       NaN
7     hello
dtype: object
--------------------------
  key1    key2
0    a    haaa
1    b  pandas
2    c       s
3    d   numpy
4    e     777
5    f     NaN
--------------------------
统计一下he有多少个：
 0    1.0
1    0.0
2    0.0
3    0.0
4    0.0
5    0.0
6    NaN
7    1.0
dtype: float64
--------------------------
大写: 
 0      HAAA
1    PANDAS
2         S
3     NUMPY
4       777
5       NaN
Name: key2, dtype: object
2117666888056
--------------------------
2117666889568
--------------------------
  key1    key2
0    a    HAAA
1    b  PANDAS
2    c       S
3    d   NUMPY
4    e     777
5    f     NaN
--------------------------
注意columns从key变成了KEY: 
   KEY1 KEY2
0    a    u
1    b    v
2    c    w
3    d    x
4    e    Y
5    f    z

一、常用方法1 - lower,len，startswith，endswith

ps = pd.Series(['he6','b','c6','D','Python','666',np.nan,'hello'])

print('小写: \n',ps.str.lower())
print('--------------------------')
print('字符长度: \n',ps.str.len())
print('--------------------------')

# startswith和endswith返回的是布尔类型
print('判断起始是否为he: \n',ps.str.startswith('he'))
print('--------------------------')
print('判断结束是否为6: \n',ps.str.endswith('6'))
print('--------------------------')

运行结果如下：

小写: 
 0       he6
1         b
2        c6
3         d
4    python
5       666
6       NaN
7     hello
dtype: object
--------------------------
字符长度: 
 0    3.0
1    1.0
2    2.0
3    1.0
4    6.0
5    3.0
6    NaN
7    5.0
dtype: float64
--------------------------
判断起始是否为he: 
 0     True
1    False
2    False
3    False
4    False
5    False
6      NaN
7     True
dtype: object
--------------------------
判断结束是否为6: 
 0     True
1    False
2     True
3    False
4    False
5     True
6      NaN
7    False
dtype: object
--------------------------

二、常用方法2 - strip，lstrip,rstrip

ps = pd.Series([' ab cd', 'efgh ', ' ijkl ', 'mnopq'])
df = pd.DataFrame(np.random.randn(4, 2), columns=['   Column A   ', '  Column B   '],
                  index=range(4))
print(ps)
print(df)
print('--------------------------')

print('去除字符串前后的空格: \n',ps.str.strip())  
print('--------------------------')
print('去除字符串前面的空格: \n',ps.str.lstrip())
print('--------------------------')
print('去除字符串后面的空格: \n',ps.str.rstrip())
print('--------------------------')
# 去掉了columns的前后空格
df.columns = df.columns.str.strip()
print(df)

运行结果如下：

0     ab cd
1     efgh 
2     ijkl 
3     mnopq
dtype: object
      Column A       Column B   
0       -0.429990      -0.706352
1       -1.012684       0.306714
2        2.029951      -0.163117
3       -0.649404      -0.036451
--------------------------
去除字符串前后的空格: 
 0    ab cd
1     efgh
2     ijkl
3    mnopq
dtype: object
--------------------------
去除字符串前面的空格: 
 0    ab cd
1    efgh 
2    ijkl 
3    mnopq
dtype: object
--------------------------
去除字符串后面的空格: 
 0     ab cd
1      efgh
2      ijkl
3     mnopq
dtype: object
--------------------------
   Column A  Column B
0 -0.429990 -0.706352
1 -1.012684  0.306714
2  2.029951 -0.163117
3 -0.649404 -0.036451

三、常用方法3 - replace

df = pd.DataFrame(np.random.randn(4, 2), columns=['     Column A   ', '   Column B     '],
                  index=range(4))
print(df)
print('--------------------------')
df.columns = df.columns.str.replace(' ','-')
print('将空格替换成"-"：\n',df)
print('--------------------------')

df1 = pd.DataFrame(np.random.randn(4, 2), columns=['+++ Column A ++', '+ Column B ++'],
                  index=range(4))
print(df1)
print('--------------------------')
# 将columns里的“+”变成python，并且只替换前2个。
df1.columns = df1.columns.str.replace('+','python',n=2)
print(df1)

运行结果如下：

        Column A        Column B     
0         -0.495589         -0.932599
1         -1.714485         -0.658998
2          1.020396          1.301494
3         -0.057810         -0.522236
--------------------------
将空格替换成"-"：
    -----Column-A---  ---Column-B-----
0         -0.495589         -0.932599
1         -1.714485         -0.658998
2          1.020396          1.301494
3         -0.057810         -0.522236
--------------------------
   +++ Column A ++  + Column B ++
0        -0.183334      -1.331228
1        -0.372880      -0.363632
2         0.543497      -0.546076
3         0.873474       0.366067
--------------------------
   pythonpython+ Column A ++  python Column B python+
0                  -0.183334                -1.331228
1                  -0.372880                -0.363632
2                   0.543497                -0.546076
3                   0.873474                 0.366067

四、常用方法（4） - split、rsplit

ps = pd.Series(['a,b,c','x,y,3',['6,6']])
# 根据“，”对元素进行分裂。
# 注意：['6,,,6']是个list，而这个list是一个元素，所以这个列表里的“，”是不会被分裂的。
print(ps)
print('--------------------------')
print(ps.str.split(','))
print('--------------------------')
print(type(ps.str.split(',')))
print('--------------------------')

# 通过直接索引得到数据,并且得到的是一个list
print('类型是：\n',type(ps.str.split(',')[0]))
print('--------------------------')
print('得到列表中第一个元素的数据：\n',ps.str.split(',')[0])
print('--------------------------')
print('得到列表中第一个元素中的第一个的数据：\n',ps.str.split(',')[0][0])
print('--------------------------')

# 通过.str.split(',')对原来的ps进行了操作，得到的依然是Series。
# 再通过.str将其变成了字符串。
# 然后再通过[]来访问对应的数据
print(type(ps.str.split(',').str[0]))
print('--------------------------')
print(ps.str.split(',').str[0])
print('--------------------------')

# 通过get()也可以
print(ps.str.split(',').str.get(1))
print('--------------------------')

# 通过expand参数来讲series便成DataFrame
print(ps.str.split(',', expand=True))
print('--------------------------')
print(type(ps.str.split(',', expand=True)))
print('--------------------------')

# 通过n来控制扩展多少。
print(ps.str.split(',', expand=True, n = 1))
print('--------------------------')

# rsplit类似于split，反向工作，即从字符串的末尾到字符串的开头
print(ps.str.rsplit(',', expand=True, n = 1))
print('--------------------------')

# split也适用于Dataframe
df = pd.DataFrame({'key1':['a,b,c','1,2,3'],
                  'key2':['a-b-c','1-2-3']})
print(df)
print('--------------------------')
print(df['key1'].str.split(','))

运行结果如下：

0    a,b,c
1    x,y,3
2    [6,6]
dtype: object
--------------------------
0    [a, b, c]
1    [x, y, 3]
2          NaN
dtype: object
--------------------------
<class 'pandas.core.series.Series'>
--------------------------
类型是：
 <class 'list'>
--------------------------
得到列表中第一个元素的数据：
 ['a', 'b', 'c']
--------------------------
得到列表中第一个元素中的第一个的数据：
 a
--------------------------
<class 'pandas.core.series.Series'>
--------------------------
0      a
1      x
2    NaN
dtype: object
--------------------------
0      b
1      y
2    NaN
dtype: object
--------------------------
     0    1    2
0    a    b    c
1    x    y    3
2  NaN  NaN  NaN
--------------------------
<class 'pandas.core.frame.DataFrame'>
--------------------------
     0    1
0    a  b,c
1    x  y,3
2  NaN  NaN
--------------------------
     0    1
0  a,b    c
1  x,y    3
2  NaN  NaN
--------------------------
    key1   key2
0  a,b,c  a-b-c
1  1,2,3  1-2-3
--------------------------
0    [a, b, c]
1    [1, 2, 3]
Name: key1, dtype: object

五、字符串索引

ps = pd.Series(['he','be','ce','De','Python','666',np.nan,'hello'])
df = pd.DataFrame({'key1':list('abcdef'),
                  'key2':['haaa','pandas','s','numpy','777',np.nan]})

print('第一个字符串：\n',ps.str[0])
print('--------------------------')
print('前两个字符串：\n',ps.str[:2])
print('--------------------------')
print('key2列的第一个字符串：\n',df['key2'].str[0])

运行结果如下：

第一个字符串：
 0      h
1      b
2      c
3      D
4      P
5      6
6    NaN
7      h
dtype: object
--------------------------
前两个字符串：
 0     he
1     be
2     ce
3     De
4     Py
5     66
6    NaN
7     he
dtype: object
--------------------------
key2列的第一个字符串：
 0      h
1      p
2      s
3      n
4      7
5    NaN
Name: key2, dtype: object