Sunday, 23 February 2014

Python Regex Expression(Regex!!)



Introduction
Matching text patterns

Support
Python module supports re module for regex support
Visit Of Regex
Search expression
Python regular expression can be written as
match = re.search(pattern, string)

Search pattern takes following parameters
a) pattern: A regular expression pattern
b) string: A string in which we wanna find regex pattern

Return by this method:
a) It will return a match object is search is successful
b) Else, None

Lets consider an example,  
>>> import re                          # Import re module
>>> pattern = r'word'                  # Define regex pattern
>>> string = 'Test word'               # Define string on which implementation of search we want
>>> match = re.search(pattern, string)  # Syntax of search in regex(regular expression)
>>> match                               # Return an object after searching
<_sre.SRE_Match object at 0x018A1B10>  
>>> if match:
...     print 'Matched, ', match.group() # If matched, print group of words
... else:
...     print 'Opss not found..'
...
>>> Matched,  word                      # Return matched words :)

Have you noted 'r' while defining the patterns. This is actually called raw string('r') which passes through backslashes without change. r prefix (e.g. r"\d+") is actually used for raw strings in Python. In a raw string you don't have to escape backslashes with \\ (e.g. "\\s+" equals r"\s+"). On python terminal
>>> "\\s+"
'\\s+'
>>> r'\s+'
'\\s+'
So while it's a very good convention to prefix all regular expressions with r, it's not technically necessary when an expression doesn't have any backslashes (e.g. "[0-9a-z]").

Escape sequences are:
     
Basic Patterns
Following are the basic patterns in regex:

  • a, X, 9, < - Ordinary characters just match them selves. Meta characters dont match them selves.  11 metacharacters that must always be preceded by a backslash, \, to be used inside of the expression  
  • . (a period) - matches any single character except newline '\\n' 
  • \\w (lowercase w) - matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]
  • \\b  -  boundary between word and non-word
  • \\s (lowercase s) - matches a single whitespace character. space, newline, return, tab, form ['\\n \\r\\t' ]
  • \\t, \\n, \\r - tab, newline, return(respectively) 
  • \\d -  boundary between word and non-word
  • ^ = start, $ = end  -  match the start or end of the string
Some basic ExampleSome basic examples on python terminal:

>>> match = re.search(r'.', 'Test word')
>>> match.group()
>>> 'T'
>>> match = re.search(r'..', 'Test word')
>>> match.group()
>>>'Te'
>>> match = re.search(r'...', 'Test word')
>>> match.group()
>>>'Tes'
>>> match = re.search(r'.w', 'Test word')
>>> match.group()
' w'
>>> match = re.search(r't.', 'Test word')
>>> match.group()
't '
>>> match = re.search(r'.w.', 'Test word')
>>> match.group()
' wo'
>>> match = re.search(r'wor..', 'Test word')
>>> match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> match = re.search(r'wor.', 'Test word')
>>> match.group()
'word'
>>> match = re.search(r'\d', '1001 Test word')
>>> match.group()
'1'
>>> match = re.search(r'\d\d', '1001 Test word')
>>> match.group()
'10'
>>> match = re.search(r'\d\d\d', '1001 Test word')
>>> match.group()
'100'
>>> match = re.search(r'\w\w', '1001 Test word')
>>> match.group()
'10'
>>> match = re.search(r'\d\d\d', '1001 Test word @!#')
>>> match = re.search(r'\w\w', '1001 Test word @!#')
>>> match.group()
'10'
>>> match = re.search(r'\w\w\w', '@@abcd!!')
>>> match.group()
'abc'
>>> match = re.search(r'\w\w', '@#1$ 1001 Test word')
>>> match.group()
'10'

Repetition

  • (+) - One(1) or more occurances of pattern from left
  • (*) - Zero(0) or more occurances of the pattern from left
  • (?) Zero(0) or One(1) of the pattern from left
So basically it means that (+, *), match string from left most and go far as possible.

Examples

>>> match = re.search(r'Te+', 'Teest Word')
>>> match.group()
'Tee'
>>> match = re.search(r'Te+', 'Teeeast Word')
>>> match.group()
'Teee'
>>> match.group()
'Te'
>>> match = re.search(r'\d\s*\d\s*\d', 'sfad1 2   3asdxx')
>>> match.group()
'1 2   3'
>>> match.group()
'1 2   3'
>>> match = re.search(r'\d\s*\d\s*\d\s*', '1 2   3    ')
>>> match.group()
>>> match = re.search(r'\d\s*\d\s*\d', '    1234         ')
>>> match.group()
'123'
>>> match = re.search(r'\d\s*\d\s*\d\s*', '1 2   3    ')
>>> match.group()
'1 2   3    '
>>> match = re.search(r'\d\s*\d\s*\d\s*', '     123             ')
>>> match.group()
'123             '
>>> match = re.search(r'\s*\d\s*\d\s*\d', '    1234         ')
>>> match.group()
'    123'

Find E-mails in an example

>>> match = re.search(r'\w+@\w+', 'This is the search for abc@domian example')
>>> match.group()
'abc@domian'

Usage of square brackets

  • Used to indicate set of characters, i.e. [abc] matches a or b or c
  • The codes \\w, \\s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\\w.-]+@[\\w.-]+' to get the whole email address
>>> match = re.search(r'[\w.]+@[\w.]', 'This is the search for abc@domian example')
>>> match.group()
'abc@d'
>>> match = re.search(r'[\w.]+@[\w.]+', 'This is the search for abc@domian example')
>>> match.group()
'abc@domian'
>>> match = re.search(r'[\w.]+@[\w.]+', 'This is the search for abc@domian.com example')
>>> match.group()
'abc@domian.com'
>>> match = re.search(r'[\w.-]+@[\w.]+', 'This is the search for abc-def@domian.com example')
>>> match.group()
'abc-def@domian.com'

Usage Of Parenthesis()

  • Allow to pick part of matching text.
  • Do not change the pattern  text.
  • Establish groups of matched text. like match.group(1), match.group(2) etc
  • Plain match.group() match whole text as usual
For an example,

>>> match = re.search('([\w.-]+)@([\w.-]+)', 'This is an example of group matching abc-def@gmail.com')
>>> match.group()    # Whole plain match
'abc-def@gmail.com'
>>> match.group(1)   # First match of group
'abc-def'
>>> match.group(2)   # Second match of group
'gmail.com'

findall Function
Match all pattern in a text.

  • re.search() - Find first match  for a pattern.
  • findall() - Match all string and return list of matched content.
>>> match = re.findall('[\w.-]+@[\w.-]+', 'An exmapple of findall abc-def@gmail.com, xyz-uvw@yahoo.com')
>>> match
['abc-def@gmail.com', 'xyz-uvw@yahoo.com']

findall & Groups

  • () can be combined with findall.
  • If pattern have two or more (), it will return list of tuples. Each tuple represent matched pattern.

Lets take example to view this scenarion
>>> match = re.findall('([\w.-]+)@([\w.-]+)', 'An exmapple of findall abc-def@gmail.com, xyz-uvw@yahoo.com')
>>> match                  # Return list of tuples of matched content with separate groups in a tuple
[('abc-def', 'gmail.com'), ('xyz-uvw', 'yahoo.com')]
>>> match[0]        
('abc-def', 'gmail.com')
>>> match[0][0]
'abc-def'
>>> match[0][1]
'gmail.com'
>>> match[1][0]
'xyz-uvw'
>>> match[1][1]
'yahoo.com'

Thanks

1 comment: