Regular Expressions
Updated:
It has been a while since I start using regular expression for NLP and text mining purposes, that decided to give a summary on this topic. The information can also be found on goolge learning.
Hopefully this can give you a comprehensive hint of regular expression when doing NLP. Comments are welcome!
Basic patterning
It is good habit to start pattern strings with ‘r’ to designate a python ‘raw’ string.
- Meta-characters that do not match themselves . ^ $ * + ? { [ ] \ | ( )
Sign | Match usage |
---|---|
“.” (period) | any single chracter except newline ‘\n’ |
“\w” (lower) | “word” character, a letter or digit or underbar, [a-zA-Z0-9_] |
“\W “(upper) | any non-word character |
“\b” | boundary between word and non-word |
“\s” (lower) | single white space, etc. [\n \r \t \f] |
“\t”, “\n”, “\f” | tab, newline, return |
“\d” | decimal digit [0-9] inter changable with \w and \s |
“^ “= start, “$”= end | match start, end of a string |
“\” | inhibit ‘specialness’ of the above character:” \.” or “\\s” |
- regular expression ‘re.match’ will return first encounters of the matches
1 | # match continuous three numbers |
correct way (preffered expression) for “Repetition”
1 | # find the left most repetition of digits |
Repetition
- “+” – 1 or more occurence of the pattern to its left
- ‘*’ – 0 or more occurences of the pattern to its left
- “?” – match 0 or 1 occurences of the pattern to its left
findall with files
1 | # Open file |
Square brackets
“[]” can be used to indicate a set of chars, in either or manners. in “[ ]” the dot ‘.’ literal means dot sign.
1 | # extract all email address in the file |
Group extraction
The “group” feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r’([\w.-]+)@([\w.-]+)’. In this case, the parenthesis do not change what the pattern will match, instead they establish logical “groups” inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.
1 | # extract user name and the host site as tuples |
Options
The option flag is added as an extra argument to the search() or findall() etc., e.g. re.search(pat, str, re.IGNORECASE).
- IGNORECASE – ignore upper/lowercase differences for matching, so ‘a’ matches both ‘a’ and ‘A’.
- DOTALL – allow dot (.) to match newline – normally it matches anything but newline. This can trip you up – you think . matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s
MULTILINE – Within a string made of many lines, allow ^ and ‘\$’ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.
Greedy and non-Greedy
This is optional section which shows a more advanced regular expression technique not needed for the exercises.
Suppose you have text with tags in it: foo and so on
Suppose you are trying to match each tag with the pattern ‘(<.>)’ – what does it match first?
The result is a little surprising, but the greedy aspect of the . causes it to match the whole ‘foo and so on‘ as one big match. The problem is that the .* goes as far as is it can, instead of stopping at the first > (aka it is “greedy”).There is an extension to regular expression where you add a ? at the end, such as .? or .+?, changing them to be non-greedy. Now they stop as soon as they can. So the pattern ‘(<.?>)’ will get just ‘‘ as the first match, and ‘‘ as the second match, and so on getting each <..> pair in turn. The style is typically that you use a .?, and then immediately its right look for some concrete marker (> in this case) that forces the end of the .? run.
Baby name Exercise
File for the exercise can be downloaded from google-python-exercises.zip. Attached is the solution:
1 | #!/usr/bin/python |