Regular Expressions | Zehai Wang

Contents

1. Basic patterning
1. 1.1. Options
2. 1.2. Greedy and non-Greedy
2. Baby name Exercise

It has been a while since I start using regular expression for NLP and text mining purposes, that decided to give a summary on this topic. The information can also be found on goolge learning.

Hopefully this can give you a comprehensive hint of regular expression when doing NLP. Comments are welcome!

Basic patterning

It is good habit to start pattern strings with ‘r’ to designate a python ‘raw’ string.

Meta-characters that do not match themselves . ^ $ * + ? { [ ] \ | ( )

Sign	Match usage
“.” (period)	any single chracter except newline ‘\n’
“\w” (lower)	“word” character, a letter or digit or underbar, [a-zA-Z0-9_]
“\W “(upper)	any non-word character
“\b”	boundary between word and non-word
“\s” (lower)	single white space, etc. [\n \r \t \f]
“\t”, “\n”, “\f”	tab, newline, return
“\d”	decimal digit [0-9] inter changable with \w and \s
“^ “= start, “$”= end	match start, end of a string
“\”	inhibit ‘specialness’ of the above character:” \.” or “\\s”

regular expression ‘re.match’ will return first encounters of the matches

Copy Code

# match continuous three numbers
match = re.search(r'\d\d\d', '123456')

if match:
    print ('found', match.group()))

 output:
	found 123

correct way (preffered expression) for “Repetition”

Copy Code

# find the left most repetition of digits
match = re.search(r'\d+', '123456acc123')

output:
    found 123456

match = re.findall(r'\d+', '123456acc123')
if match:
    print ('found', match)

output:
    found ['123456', '123']

Repetition
- “+” – 1 or more occurence of the pattern to its left
- ‘*’ – 0 or more occurences of the pattern to its left
- “?” – match 0 or 1 occurences of the pattern to its left
findall with files

Copy Code

# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())

Square brackets

“[]” can be used to indicate a set of chars, in either or manners. in “[ ]” the dot ‘.’ literal means dot sign.

Copy Code

# extract all email address in the file
f = open ('test.txt','r')
emails = re.findall(r'[\w.-]+@[\w.-]+', f.read())

if emails:
    print ('found', emails)

output:
    found ['simple@example.com', 'very.common@example.com', 'symbol@example.com', 'other.email-with-hyphen@example.com']

Group extraction

The “group” feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r’([\w.-]+)@([\w.-]+)’. In this case, the parenthesis do not change what the pattern will match, instead they establish logical “groups” inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

Copy Code

# extract user name and the host site as tuples
f = open ('test.txt','r')
emails = re.findall(r'([\w.-]+)@([\w.-]+)', f.read())

if emails:
    for email in emails:
        print ('username:', email[0],'\n'
            'host:', email[1],'\n')

output:
    username: simple 
    host: example.com 

    username: very.common 
    host: example.com 

    username: symbol 
    host: example.com 

    username: other.email-with-hyphen 
    host: example.com

Options

The option flag is added as an extra argument to the search() or findall() etc., e.g. re.search(pat, str, re.IGNORECASE).

IGNORECASE – ignore upper/lowercase differences for matching, so ‘a’ matches both ‘a’ and ‘A’.
DOTALL – allow dot (.) to match newline – normally it matches anything but newline. This can trip you up – you think . matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s
MULTILINE – Within a string made of many lines, allow ^ and ‘\$’ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.

Greedy and non-Greedy
This is optional section which shows a more advanced regular expression technique not needed for the exercises.

Suppose you have text with tags in it: foo and so on

Suppose you are trying to match each tag with the pattern ‘(<.>)’ – what does it match first?
The result is a little surprising, but the greedy aspect of the . causes it to match the whole ‘foo and so on‘ as one big match. The problem is that the .* goes as far as is it can, instead of stopping at the first > (aka it is “greedy”).

There is an extension to regular expression where you add a ? at the end, such as .? or .+?, changing them to be non-greedy. Now they stop as soon as they can. So the pattern ‘(<.?>)’ will get just ‘‘ as the first match, and ‘‘ as the second match, and so on getting each <..> pair in turn. The style is typically that you use a .?, and then immediately its right look for some concrete marker (> in this case) that forces the end of the .? run.

Baby name Exercise

File for the exercise can be downloaded from google-python-exercises.zip. Attached is the solution:

Copy Code

#!/usr/bin/python
# Copyright 2010 Google Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Google's Python Class
# http://code.google.com/edu/languages/google-python-class/

# solution from zehai wang 07.2018 at RPI
# copy right reserved
# Licensed under the Apache License, Version 2.0
import sys
import re

"""Baby Names exercise

Define the extract_names() function below and change main()
to call it.

For writing regex, it's nice to include a copy of the target
text for inspiration.

Here's what the html looks like in the baby.html files:
...
<h3 align="center">Popularity in 1990</h3>
....
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td>
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td>
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>
...

Suggested milestones for incremental development:
 -Extract the year and print it
 -Extract the names and rank numbers and just print them
 -Get the names data into a dict and print it
 -Build the [year, 'name rank', ... ] list and print it
 -Fix main() to use the extract_names list
"""
def extract_names(filename):
  """
  Given a file name for baby.html, returns a list starting with the year string
  followed by the name-rank strings in alphabetical order.
  ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
  """
  ###------- code start here ------------------###
  f = open(filename)
  name_list = re.findall(r'Popularity\sin\s+(\d\d\d\d)</h3>', f.read())
  f.seek(0) # re-initialize the file
  names = re.findall(r'<td>(\d+)</td><td>(\w+)</td>', f.read())
  name_dic = dict((i[1],i[0]) for i in names)
  # sort the keys
  name_keys = list(name_dic.keys())
  name_keys.sort()
  for name in name_keys:
    name_list.append('%s %s'%(name,name_dic[name]))
  
  f.close()
  ###------- code end here ------------------###
  return name_list

def main():
  # This command-line parsing code is provided.
  # Make a list of command line arguments, omitting the [0] element
  # which is the script itself.
  args = sys.argv[1:]

  if not args:
    print ('usage: [--summaryfile] file [file ...]')
    sys.exit(1)

  # Notice the summary flag and remove it from args if it is present.
  summary = False
  if args[0] == '--summaryfile':
    summary = True
    del args[0]
  
  ###------- code start here ------------------###
  if not summary:
    for file in args:
        # matches = extract_names(file)
        names = extract_names(file)
        text = '\n'.join(names) + '\n'
        print (text)
  else:
    for file in args:
        f = open('%s.summary'%file,'w')
        names = extract_names(file)
        text = '\n'.join(names) + '\n'
        f.write(text)
        f.close()
  ###------- code end here ------------------###


  # For each filename, get the names, then either print the text output
  # or write it to a summary file
if __name__ == '__main__':
  main()

Basic patterning

Options

Greedy and non-Greedy

Baby name Exercise