Python 中正则表达式的用法

部分正则语法

=> r"\n" 表示两个字符 \ 和 n，而 "\n" 表示一个字符，即换行符。

=> {m,n}? 匹配尽可能少的字符

=> [] 用来表示一组字符，在 [] 中特殊字符失去它们特殊的含义，即 [(+*)] 就匹配 ( + * ) 这四个字符。

=> [^5] 匹配所有不是 5 的字符，[^^] 匹配所有不是 ^ 的字符，在 [] 中，如果 ^ 不是第一个字符，那么它就没有特殊意义。

=> (?aiLmsux) 放在正则表达式前，用来开启正则表达式标志，这个表达式匹配空字符串。其中 ‘a’ 表示 re.A，’i’ 表示 re.I ……

re.match(r'(?i)ab|AB', 'aB')

=> (?:...) 表示不捕获，有时候需要使用括号，但是又不想捕获分组，这个时候可用此法。

=> (?P<name>...) 用来给一个分组进行命名，一个正则表达式中如果有多个分组，尤其是分组之间有嵌套的时候，后向引用分组的时候，使用数字就显得很麻烦且容易出错，而给一个分组命名后，就方便引用了。引用的时候使用 (?P=name) 即可。

re.match(r'(?P<underscore>_{1,2})abc(?P=underscore)', '__abc__')

=> (?#...) 这里表示一个注释，括号中的内容会被忽略。

=> (?=...) 前向断言，如下例子，只匹配后面跟着 def 的 abc：

re.match(r'abc(?=def)', 'abcdef')

=> (?!...) 如下，匹配后面不跟 d 的 abc：

re.match(r'abc(?!d)', 'abce')

=> (?<=...) 和 (?<!...)

>>> m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
'def'

=> (?(id/name)yes-pattern|no-pattern)

这里 id/name 表示一个分组的编号或者名称，即如果此前这个分组已经匹配到了，那么就匹配 yes-pattern 否则匹配（可选）no-pattern

>>> re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', 'wangyu@163.com').group(0)
<<< 'wangyu@163.com'

>>> re.match(r'(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)', '<wangyu@163.com>').group(0)
<<< '<wangyu@163.com>'

这里如果前面的 < 有匹配上，那么结尾就要匹配一个 > 否则匹配 $。

=> \A，匹配字符的开头，这和 ^ 是有差别的，\A 只会匹配一个位置，那就是字符串开头。而 ^ 在多行模式下，可以匹配每一行的开头。

=> \Z 匹配字符的结尾。

=> \b，匹配 \W 和 \w 的边界，如 re.search(r'ABC\b', 'ABC>')

=> \B，匹配 \w 与 \w 或者 \W 与 \W 的边界。

=> \d 和 \D 分别匹配数字和非数字。

=> \s 匹配空白符，即 [\t\n\r\f\v]。

=> \S 匹配非空白，即 [^\t\n\r\f\v]。

=> \w 匹配字符，ASCII 范围内等价于 [a-zA-Z0-9]

=> \W 匹配非 \w。

re 模块

=> re.compile

返回一个正则表达式对象，上面有很多下面介绍的在 re 模块上包含的方法。

re.compile(pattern, flags=0)

<<< r = re.compile(r'<(\d+)>')
<<< r.match('<323>').groups()
>>> ('323',)

=> re.search

在字符中搜索匹配的 pattern，如果有匹配则返回 match object，否则返回 None。

match = re.search(pattern, string)
if match:
    process(match)

=> re.match

如果字符串的开头与 pattern 匹配，则返回 match object，否则返回 None。

re.match(pattern, string, flags=0)

如果整个字符串都匹配，则返回 match object

re.fullmatch(pattern, string, flags=0)

=> re.split，使用正则表达式对字符串进行切分。

re.split(pattern, string, maxsplit=0, flags=0)

如果用于切分的正则表达式中包含分组，那么分组捕获的内容也会返回。

>>> re.split(r'<->', '123<->456<->789')
<<< ['123', '456', '789']

>>> re.split(r'<(-)>', '123<->456<->789')
<<< ['123', '-', '456', '-', '789']

=> re.findall

返回所有匹配的子字符串。

<<< re.findall(r'\d{3}', '123<->456<->789')
>>> ['123', '456', '789']

# 含有多个分组的时候，返回的是元组的列表
<<< re.findall(r'(\d)\d(\d)', '123<->456<->789')
>>> [('1', '3'), ('4', '6'), ('7', '9')]

=> re.finditer

返回 match object 的迭代器，由此可以得到所有的匹配项，用于处理较长的文本。

=> re.sub

这个方法用来从源字符串中匹配部分内容，然后通过一个模板构成新的字符串。

re.subn(pattern, repl, string, count=0, flags=0)

举例子，比如用 123->456 构造出 '456<-123'

pattern 部分匹配 123 和 456，然后 repl 部分使用匹配项，拼凑出目标字符串。

这里 repl 是一个字符串，要想使用 pattern 匹配的分组，可以使用 \1 这样的写法，对于命名分组，可以使用 \g<name> 这样的写法。

>>> re.sub(r'(?P<from>\d{3})->(?P<to>\d{3})', '\g<to><-\g<from>', '123->456')
<<< '456<-123'

Match 对象

=> Match.expand

<<< match = re.search(r'<(?P<name>\w+)@(\w+)\.(\w+)>', '<wangyu@163.com>')
<<< match.expand('mailto:\g<name>[at]\g<2>[dot]\g<3>')
>>> 'mailto:wangyu[at]163[dot]com'

=> Match.group

<<< match = re.search(r'<(?P<name>\w+)@(\w+)\.(\w+)>', '<wangyu@163.com>')
<<< match.group(0) # or match.group()
>>> '<wangyu@163.com>'

<<< match.group(0,1,2,3)
>>> ('<wangyu@163.com>', 'wangyu', '163', 'com')

=> Match.groups

以元组的形式返回所有分组。

<<< match.groups()
>>> ('wangyu', '163', 'com')

=> Match.groupdict

以字典的形式返回所有命名分组，未命名的分组会被忽略。

<<< match = re.search(r'<(?P<name>\w+)@(?P<host>\w+.\w+)>', '<wangyu@163.com>')
<<< match.groupdict()
>>> {'name': 'wangyu', 'host': '163.com'}

=> Match.start / Match.end

返回匹配的分组的起始和结束位置。

<< match = re.search(r'<(?P<name>\w+)@(?P<host>\w+.\w+)>', '<wangyu@163.com>')
<<< match.start(),match.start(1),match.start(2)
>>> (0, 1, 8)

<<< match.end(),match.end(1),match.end(2)
>>> (16, 7, 15)

=> Match.span

返回指定分组的起始和结束位置。

<<< match.span(),match.span(1),match.span(2)
>>> ((0, 16), (1, 7), (8, 15))