Matching text patterns
Support
Python module supports re module for regex support
Visit Of Regex
Search expressionPython regular expression can be written as
match = re.search(pattern, string)
Search pattern takes following parameters
a) pattern: A regular expression pattern
b) string: A string in which we wanna find regex pattern
Return by this method:
a) It will return a match object is search is successful
b) Else, None
Lets consider an example,
>>> import re # Import re module
>>> pattern = r'word' # Define regex pattern
>>> string = 'Test word' # Define string on which implementation of search we want
>>> match = re.search(pattern, string) # Syntax of search in regex(regular expression)
>>> match # Return an object after searching
<_sre.SRE_Match object at 0x018A1B10>
>>> if match:
... print 'Matched, ', match.group() # If matched, print group of words
... else:
... print 'Opss not found..'
...
>>> Matched, word # Return matched words :)
Have you noted 'r' while defining the patterns. This is actually called raw string('r') which passes through backslashes without change. r prefix (e.g. r"\d+") is actually used for raw strings in Python. In a raw string you don't have to escape backslashes with \\ (e.g. "\\s+" equals r"\s+"). On python terminal
>>> "\\s+"
'\\s+'
>>> r'\s+'
'\\s+'
So while it's a very good convention to prefix all regular expressions with r, it's not technically necessary when an expression doesn't have any backslashes (e.g. "[0-9a-z]").
Escape sequences are:
Basic Patterns
Following are the basic patterns in regex:
- a, X, 9, < - Ordinary characters just match them selves. Meta characters dont match them selves. 11 metacharacters that must always be preceded by a backslash, \, to be used inside of the expression
- . (a period) - matches any single character except newline '\\n'
- \\w (lowercase w) - matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]
- \\b - boundary between word and non-word
- \\s (lowercase s) - matches a single whitespace character. space, newline, return, tab, form ['\\n \\r\\t' ]
- \\t, \\n, \\r - tab, newline, return(respectively)
- \\d - boundary between word and non-word
- ^ = start, $ = end - match the start or end of the string
>>> match = re.search(r'.', 'Test word')
>>> match.group()
>>> 'T'
>>> match = re.search(r'..', 'Test word')
>>> match.group()
>>>'Te'
>>> match = re.search(r'...', 'Test word')
>>> match.group()
>>>'Tes'
>>> match = re.search(r'.w', 'Test word')
>>> match.group()
' w'
>>> match = re.search(r't.', 'Test word')
>>> match.group()
't '
>>> match = re.search(r'.w.', 'Test word')
>>> match.group()
' wo'
>>> match = re.search(r'wor..', 'Test word')
>>> match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> match = re.search(r'wor.', 'Test word')
>>> match.group()
'word'
>>> match = re.search(r'\d', '1001 Test word')
>>> match.group()
'1'
>>> match = re.search(r'\d\d', '1001 Test word')
>>> match.group()
'10'
>>> match = re.search(r'\d\d\d', '1001 Test word')
>>> match.group()
'100'
>>> match = re.search(r'\w\w', '1001 Test word')
>>> match.group()
'10'
>>> match = re.search(r'\d\d\d', '1001 Test word @!#')
>>> match = re.search(r'\w\w', '1001 Test word @!#')
>>> match.group()
'10'
>>> match = re.search(r'\w\w\w', '@@abcd!!')
>>> match.group()
'abc'
>>> match = re.search(r'\w\w', '@#1$ 1001 Test word')
>>> match.group()
'10'
Repetition
- (+) - One(1) or more occurances of pattern from left
- (*) - Zero(0) or more occurances of the pattern from left
- (?) Zero(0) or One(1) of the pattern from left
Examples
>>> match = re.search(r'Te+', 'Teest Word')
>>> match.group()
'Tee'
>>> match = re.search(r'Te+', 'Teeeast Word')
>>> match.group()
'Teee'
>>> match.group()
'Te'
>>> match = re.search(r'\d\s*\d\s*\d', 'sfad1 2 3asdxx')
>>> match.group()
'1 2 3'
>>> match.group()
'1 2 3'
>>> match = re.search(r'\d\s*\d\s*\d\s*', '1 2 3 ')
>>> match.group()
>>> match = re.search(r'\d\s*\d\s*\d', ' 1234 ')
>>> match.group()
'123'
>>> match = re.search(r'\d\s*\d\s*\d\s*', '1 2 3 ')
>>> match.group()
'1 2 3 '
>>> match = re.search(r'\d\s*\d\s*\d\s*', ' 123 ')
>>> match.group()
'123 '
>>> match = re.search(r'\s*\d\s*\d\s*\d', ' 1234 ')
>>> match.group()
' 123'
Find E-mails in an example
>>> match = re.search(r'\w+@\w+', 'This is the search for abc@domian example')
>>> match.group()
'abc@domian'
Usage of square brackets
- Used to indicate set of characters, i.e. [abc] matches a or b or c
- The codes \\w, \\s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\\w.-]+@[\\w.-]+' to get the whole email address
>>> match.group()
'abc@d'
>>> match = re.search(r'[\w.]+@[\w.]+', 'This is the search for abc@domian example')
>>> match.group()
'abc@domian'
>>> match = re.search(r'[\w.]+@[\w.]+', 'This is the search for abc@domian.com example')
>>> match.group()
'abc@domian.com'
>>> match = re.search(r'[\w.-]+@[\w.]+', 'This is the search for abc-def@domian.com example')
>>> match.group()
'abc-def@domian.com'
Usage Of Parenthesis()
- Allow to pick part of matching text.
- Do not change the pattern text.
- Establish groups of matched text. like match.group(1), match.group(2) etc
- Plain match.group() match whole text as usual
>>> match = re.search('([\w.-]+)@([\w.-]+)', 'This is an example of group matching abc-def@gmail.com')
>>> match.group() # Whole plain match
'abc-def@gmail.com'
>>> match.group(1) # First match of group
'abc-def'
>>> match.group(2) # Second match of group
'gmail.com'
findall Function
Match all pattern in a text.
- re.search() - Find first match for a pattern.
- findall() - Match all string and return list of matched content.
>>> match
['abc-def@gmail.com', 'xyz-uvw@yahoo.com']
findall & Groups
- () can be combined with findall.
- If pattern have two or more (), it will return list of tuples. Each tuple represent matched pattern.
Lets take example to view this scenarion
>>> match = re.findall('([\w.-]+)@([\w.-]+)', 'An exmapple of findall abc-def@gmail.com, xyz-uvw@yahoo.com')
>>> match # Return list of tuples of matched content with separate groups in a tuple
[('abc-def', 'gmail.com'), ('xyz-uvw', 'yahoo.com')]
>>> match[0]
('abc-def', 'gmail.com')
>>> match[0][0]
'abc-def'
>>> match[0][1]
'gmail.com'
>>> match[1][0]
'xyz-uvw'
>>> match[1][1]
'yahoo.com'
Thanks
Well explained :)
ReplyDelete