# Python正则表达式简介

2年前 阅读 443 点赞 3

1、正则表达式介绍

2、什么是正则表达式，如何编译？

import re
regex = re.compile('\s+')


3、如何用正则表达式分隔字符串？

text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English"""


1. 使用re.split()
2. 调用regex对象的split()方法，即regex.split()
# split the text around 1 or more space characters
re.split('\s+', text)
# or
regex.split(text)
#> ['101', 'COM', 'Computers', '205', 'MAT', 'Mathematics', '189', 'ENG', 'English']


4、使用findall、research和match实现正则匹配

4.1 re.findall()

# find all numbers within the text
print(text)
regex_num = re.compile('\d+')
regex_num.findall(text)
#> 101 COM    Computers
#> 205 MAT   Mathematics
#> 189 ENG   English
#> ['101', '205', '189']


#### 4.2 re.search() vs re.match()

# define the text
text2 = """COM    Computers
205 MAT   Mathematics 189"""

# compile the regex and search the pattern
regex_num = re.compile('\d+')
s = regex_num.search(text2)

print('Starting Position: ', s.start())
print('Ending Position: ', s.end())
print(text2[s.start():s.end()])
#> Starting Position:  17
#> Ending Position:  20
#> 205


print(s.group())
#> 205


﻿

m = regex_num.match(text2)
print(m)
#> None


5、如何用正则替换？

# define the text
text = """101   COM \t  Computers
205   MAT \t  Mathematics
189   ENG  \t  English"""
print(text)
#> 101   COM    Computers
#> 205   MAT     Mathematics
#> 189   ENG     English


# replace one or more spaces with single space
regex = re.compile('\s+')
print(regex.sub(' ', text))
# or
print(re.sub('\s+', ' ', text))
#> 101 COM Computers 205 MAT Mathematics 189 ENG English


# get rid of all extra spaces except newline
regex = re.compile('((?!\n)\s+)')
print(regex.sub(' ', text))
#> 101 COM Computers
#> 205 MAT Mathematics
#> 189 ENG English


6、正则表达式组

text = """101   COM   Computers
205   MAT   Mathematics
189   ENG    English"""

# 1. extract all course numbers
re.findall('[0-9]+', text)

# 2. extract all course codes
re.findall('[A-Z]{3}', text)

# 3. extract all course names
re.findall('[A-Za-z]{4,}', text)

#> ['101', '205', '189']
#> ['COM', 'MAT', 'ENG']
#> ['Computers', 'Mathematics', 'English']


# define the course text pattern groups and extract
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
re.findall(course_pattern, text)
#> [('101', 'COM', 'Computers'), ('205', 'MAT', 'Mathematics'), ('189', 'ENG', 'English')]


7、贪婪匹配？

text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']


re.findall('<.*?>', text)
#> ['<body>', '</body>']


re.search('<.*?>', text).group()
#> '<body>'


8、常见的正则表达式

基本表达式

.             One character except new line
\.            A period. \ escapes a special character.
\d            One digit
\D            One non-digit
\w            One word character including digits
\W            One non-word character
\s            One whitespace
\S            One non-whitespace
\b            Word boundary
\n            Newline
\t            Tab

\$             End of string
^             Start of string
ab|cd         Matches ab or de.
[ab-d]	      One character of: a, b, c, d
[^ab-d]	      One character except: a, b, c, d
()            Items within parenthesis are retrieved
(a(bc))       Items within the sub-parenthesis are retrieved

[ab]{2}       Exactly 2 continuous occurrences of a or b
[ab]{2,5}     2 to 5 continuous occurrences of a or b
[ab]{2,}      2 or more continuous occurrences of a or b
+             One or more
*             Zero or more
?             0 or 1


9、正则表达式示例

#### 9.1. 任意字符（换行符除外）

text = 'ziiai.com'
print(re.findall('.', text))  # .   Any character except for a new line
print(re.findall('...', text))
#> ['z', 'i', 'i', 'a', 'i', '.', 'c', 'o', 'm']
#> ['zii', 'ai.', 'com']


#### 9.2. 点号（“.”）

text = 'ziiai.com'
print(re.findall('\.', text))  # matches a period
print(re.findall('[^\.]', text))  # matches anything but a period
#> ['.']
#> ['z', 'i', 'i', 'a', 'i', 'c', 'o', 'm']


#### 9.3. 任意数字

text = '01, Jan 2015'
print(re.findall('\d+', text))  # \d  Any digit. The + mandates at least 1 digit.
#> ['01', '2015']


#### 9.4. 任意非数字

text = '01, Jan 2015'
print(re.findall('\D+', text))  # \D  Anything but a digit
#> [', Jan ']


#### 9.5. 任意包括数字

text = '01, Jan 2015'
print(re.findall('\w+', text))  # \w  Any character
#> ['01', 'Jan', '2015']


#### 9.6. 仅字符

text = '01, Jan 2015'
print(re.findall('\W+', text))  # \W  Anything but a character
#> [', ', ' ']


#### 9.7. 字符集

text = '01, Jan 2015'
print(re.findall('[a-zA-Z]+', text))  # [] Matches any character inside
#> ['Jan']


#### 9.8. 连续出现次数

text = '01, Jan 2015'
print(re.findall('\d{4}', text))  # {n} Matches repeat n times.
print(re.findall('\d{2,4}', text))
#> ['2015']
#> ['01', '2015']


#### 9.9. 一次或多次

print(re.findall(r'Co+l', 'So Cooool'))  # Match for 1 or more occurrences
#> ['Cooool']


#### 9.10. 0次或多次

print(re.findall(r'Pi*lani', 'Pilani'))
#> ['Pilani']


#### 9.11. 0次或一次

print(re.findall(r'colou?r', 'color'))
['color']


9.12、匹配字边界

“\b” 通常用于检测和匹配单词的开头或结尾。也就是说，一边是单词字符，另一边是空白，反之亦然。

re.findall(r'\btoy\b', 'play toy broke toys')  # match toy with boundary on both sides
#> ['toy']


10、总结