语言基础
Python
Pandas

字符串处理

import numpy as np
import pandas as pd

1
2

以通过 .str 将 Series/Index 对象中的元素转换为字符串进行下一步操作。eries 和 Index 对象才有的属性。

s = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s

1
2

0    a_b_c
1    c_d_e
2      NaN
3    f_g_h
dtype: object

s.str[0] # 可以直接通过位置索引

0      a
1      c
2    NaN
3      f
dtype: object

s.str.split('_')

0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

s.str.split('_', expand=True) # 指定参数进行扩展

	0	1	2
0	a	b	c
1	c	d	e
2	NaN	NaN	NaN
3	f	g	h

s.str.split('_', expand=True, n=1) # 限制扩展的数量

	0	1
0	a	b_c
1	c	d_e
2	NaN	NaN
3	f	g_h

s.str.split('_', expand=True).get(1)  # 等效于 s.str.split('_',expand = True)[1]

0      b
1      d
2    NaN
3      g
Name: 1, dtype: object

# rsplit和split方法类似，只是从反方向来访问
s.str.rsplit('_', expand=True, n=1)

1
2

	0	1
0	a_b	c
1	c_d	e
2	NaN	NaN
3	f_g	h

可以通过 cat()方法将单个 Series/Index 或者将 Series/Index 和其他 Series/Index 连接起来

s = pd.Series(['a', 'b', 'c', 'd', np.nan])
s

1
2

0      a
1      b
2      c
3      d
4    NaN
dtype: object

s.str.cat()

'abcd'

s.str.cat(sep=',')

'a,b,c,d'

# 默认情况下，缺失值是会被直接忽略的，可以通过 na_rep 指定一个值来代替缺失值
s.str.cat(sep=',', na_rep='-')

1
2

'a,b,c,d,-'

# 可以将 Series（或Index） 和其他与列表类似的对象连接起来，但是两者的长度要一致，否则会报错
s.str.cat(('A', 'B', 'C', 'D', 'E')) # na_rep 指定缺失值的替代值

1
2

0     aA
1     bB
2     cC
3     dD
4    NaN
dtype: object

# 也可以将 Series（或Index）与类似于数组的对象连接，但是Series（或Index）的长度必须和该数组的行数相同
df = pd.DataFrame([['1', '2', '3', '4', '5'],['1', '2', '3', '4', '5']])
df = df.T
df
s.str.cat(df, na_rep='-')

1
2
3
4
5

	0	1
0	1	1
1	2	2
2	3	3
3	4	4
4	5	5

0    a11
1    b22
2    c33
3    d44
4    -55
dtype: object

# 按照索引对齐连接
s2 = pd.Series(['b', 'd', 'a', 'c' , 'e'], index=[1, 3, 0, 2, 4])
s
s2

1
2
3
4

0      a
1      b
2      c
3      d
4    NaN
dtype: object

1    b
3    d
0    a
2    c
4    e
dtype: object

# 默认情况
s.str.cat(s2)
# 对齐连接
s.str.cat(s2, join='left', na_rep='-') # 'left', 'outer', 'inner', 'right'

1
2
3
4

0     aa
1     bb
2     cc
3     dd
4    NaN
dtype: object

0    aa
1    bb
2    cc
3    dd
4    -e
dtype: object

可以使用 extract() 方法对特定内容进行提取

# 通过正则来提取
# expand 参数默认为True，返回一个DataFrame，若为False，则返回 Series/Index/DataFrame
s = pd.Series(['a1', 'b2', 'c3'])
s.str.extract(r'([ab])(\d)', expand=False) # 提取具有多个组的正则表达式将返回一个DataFrame，每个组一列

1
2
3
4

	0	1
0	a	1
1	b	2
2	NaN	NaN

# 可以在提取的时候指定列
s.str.extract(r'(?P<letter>[ab])(?P<digit>\d)', expand=False)

1
2

	letter	digit
0	a	1
1	b	2
2	NaN	NaN

上次更新: 2023/11/01, 03:11:44

← 缺失值处理 pandas sql→