Recording the Way

Regex Syntax

2018/01/08

'.'

(Dot.) 表示除换行符意外的所有符号。

'$'

本质上来说,代表换行符,为本行的换行符。

'^'

(Caret.)本质上来说也可以表示换行符,只不过是上一行的换行符。

'?'

表示其前面的字符出现一次或者没有出现。

*?, +?, ??

The '*', '+', and '?'会匹配尽可能多的匹配想,而*?, +?, ??一旦发现匹配项即立即终止匹配。

{m,n}?

同上。

'\'

转义符,消除紧连其后符号的特殊意义,只表示符号本身。

[]

Special characters(except ‘/‘) lose their special meaning inside sets.

Character classes such as \w or \S (defined below) are also accepted inside a set.

If the first character of the set is '^', all the characters that are not in the set will be matched. ^ has no special meaning if it’s not the first character in the set.

To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set.

(?aiLmsux)

(One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.)

The group matches the empty string(也就是说可以把空的string也可以匹配进来); the letters set the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent),re.M (multi-line), re.S (dot matches all,即包括换行符), and re.X (verbose), for the entire regular expression.

This is useful if you wish to include the flags(也就是多个标识) as part of the regular expression, instead of passing a flag (而不是一个标识)argument to the re.compile() function. Flags should be used first in the expression string.

(?:...)

表示不匹配匹配但不输出冒号后的内容。

(?imsx-imsx:...)

Zero or more letters from the set 'i', 'm', 's', 'x', optionally followed by '-' followed by one or more letters from the same set.因为有了冒号,前面可以有也可以没有'i', 'm', 's', 'x'

最终是匹配但不输出的意思。

(?P<name>...)

Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. (与常规的括号意义相同,只不过匹配的东西可以用唯一标示的名字来指定)。A symbolic group is also a numbered group, just as if the group were not named.(即是说标示也可以用数字来表示)

If the pattern is (?P<quote>['"]).*?(?P=quote) (i.e. matching a string quoted with either single or double quotes):

Context of reference to group “quote” Ways to reference it
in the same pattern itself (?P=quote) (as shown)\1
when processing match object m m.group('quote')``m.end('quote') (etc.)
in a string passed to the repl argument of re.sub() \g<quote>``\g<1>``\1

(?P=name)

A backreference to a named group(回指)

(?#...)

A comment

(?=...)

Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion.For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.也就是说,这里的❓可以表示为占位符,假设等号后的内容紧随❓前的内容,则匹配,否则不匹配。可以称之为前置假设成立

(?!...)

Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov)will match 'Isaac ' only if it’s not followed by 'Asimov'.(与上述的功能相反),可以称之为前置建设不成立

(?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef可以称之为后置假设成立。参见(?=...)。The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not.(也就是说等号后的内容必须是固定长度,这一点与参考不同)

(?<!...)

Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion.(称之为后置假设不成立)语法同上。

(?(id/name)yes-pattern|no-pattern)

Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. no-pattern is optional and can be omitted. For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$) is a poor email matching pattern, which will match with '<user@host.com>' as well as 'user@host.com', but not with '<user@host.com' nor 'user@host.com>'.


The special sequences consist of '\' and a character from the list below. If the ordinary character is not an ASCII digit or an ASCII letter, then the resulting RE will match the second character. For example, \$ matches the character '$'.

即是说除了下面列举出来的,‘/’均为转义的意思(放在[]内的除外)

\number

Matches the contents of the group of the same number.

\A

Matches only at the start of the string.(可以表示为string前的空格),与’\Z’相反

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.也就是说’\b’表示的是除去字母、整数、连字符、一个空格的集合中的单个元素

\B

Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B'matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b, 也就是说’\B’表示的是字母、整数、连字符、一个空格的集合中的单个元素。

\d

基本上可以认为匹配 [0-9].

\D

与\d相反。

\s

基本上可以认为只匹配无字符输出的符号,如空格、换行、纵向制表符、横向制表符 ,即[ \t\n\r\f\v]

\S

与上述相反[^ \t\n\r\f\v]

\w

基本可以理解为匹配:[a-zA-Z0-9_]。同’\b’,只是‘\b’用于界定,而’\w’用来匹配

\W

与上述相反:[^a-zA-Z0-9_] 。同’\B’,只是‘\B’用于界定,而’\W’用来匹配

\Z

Matches only at the end of the string. 与‘\A’相反。


Module Contents

re.compile(*pattern*, *flags=0*)

The sequence

1
2
prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

1
result = re.match(pattern, string)

re.A

re.ASCII

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching.

re.DEBUG

Display debug information about compiled expression.

re.I

re.IGNORECASE

re.L

re.LOCALE

re.M

re.MULTILINE

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line. and the pattern character '$' matches at the end of the string and at the end of each line.

re.S

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

re.X

re.VERBOSE

re.search(pattern, string, flags=0)

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object.

re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.

re.fullmatch(pattern, string, flags=0)

If the whole string matches the regular expression pattern, return a corresponding match object.

re.split(pattern, string, maxsplit=0, flags=0)

Split string by the occurrences of pattern.

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings.

re.finditer(pattern, string, flags=0)

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string.

re.purge()

Clear the regular expression cache.

Regular Expression Objects

regex.search(string[, pos[, endpos]])

regex.match(string[, pos[, endpos]])

regex.fullmatch(string[, pos[, endpos]])

regex.split(string, maxsplit=0)

Identical to the split() function, using the compiled pattern.

regex.findall(string[, pos[, endpos]])

Similar to the findall() function, using the compiled pattern

regex.finditer(string[, pos[, endpos]])

Similar to the finditer() function, using the compiled pattern

regex.pattern

The pattern string from which the RE object was compiled.

Match Objects

match.expand(template)

match.group([group1, ])

Returns one or more subgroups of the match.

1
2
3
4
5
6
7
8
9
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
>>> m.group(2) # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2) # Multiple arguments give us a tuple.
('Isaac', 'Newton')
1
2
3
4
5
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.group('first_name')
'Malcolm'
>>> m.group('last_name')
'Reynolds'
1
2
3
4
>>> m.group(1)
'Malcolm'
>>> m.group(2)
'Reynolds'
1
2
3
>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.贪婪型,匹配三个字符。
>>> m.group(1) # Returns only the last match.
'c3'

match.__getitem__(g)

This is identical to m.group(g).

1
2
3
4
5
6
7
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m[0] # The entire match
'Isaac Newton'
>>> m[1] # The first parenthesized subgroup.
'Isaac'
>>> m[2] # The second parenthesized subgroup.
'Newton'

match.groups(default=None)

Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.

1
2
3
>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
>>> m.groups()
('24', '1632')
1
2
3
4
5
>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
>>> m.groups() # Second group defaults to None.
('24', None)
>>> m.groups('0') # Now, the second group defaults to '0'.
('24', '0')

match.groupdict(default=None)

Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name.

1
2
3
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
>>> m.groupdict()
{'first_name': 'Malcolm', 'last_name': 'Reynolds'}

match.start([group])

match.end([group])

Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). Return -1 if group exists but did not contribute to the match.

match.span([group])

For a match m, return the 2-tuple (m.start(group), m.end(group)).

match.pos

The value of pos which was passed to the search() or match() method of a regex object.

match.endpos

match.lastindex

match.lastgroup

match.re

The regular expression object whose match() or search() method produced this match instance.

match.string

The string passed to match() or search().

CATALOG
  1. 1. Module Contents
  2. 2. Regular Expression Objects
  3. 3. Match Objects