Lecture 26 - Regular Expressions

Lecture Date: Friday, March 25

Using the string libraries to parse a body of text is really good for reading information from top to bottom and making decisions based on that information. It’s also good for parsing data into another data structure.

But sometimes you just want to find all of one particular kind of data. Or you want to verify that some piece of text follows a very particular format. For those instances, we can use regular expressions.

A regular expression is a form of pattern matching. Using a defined set of tokens, we can set up a pattern that must be met before a piece of text will be accepted and returned.

Here’s a starter list of tokens that you might use:

1
2
3
4
5
6
7
8
[abc] - one of those three characters
[a-z] - all lowercase
[a-z0-9] - add in numbers
. - any one character
\. - an actual period
* - 0 to many
? - 0 or 1
+ - 1 to many

Let’s test some regular expressions at https://regex101.com/!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Mark Sherriff (mss2x)

import re
import urllib.request

def is_valid_password(password):
    regex = re.compile(r"^.*(?=.{8,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$")
    results = regex.search(password)
    if results != None:
        return True
    return False

def find_possible_names(text):
    regex = re.compile(r"[A-Z][a-z]*")
    results = regex.search(text)
    if results != None:
        return results.group()
    return None

def find_all_possible_names(text):
    names = []

    regex = re.compile(r"[A-Z][a-z]*")
    results = regex.findall(text)
    if len(results) != 0:
        for name in results:
            names.append(name)
    return names

def find_all_phone_numbers(text):
    numbers = []

    regex = re.compile(r"[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]")
    results = regex.findall(text)
    if len(results) != 0:
        for number in results:
            numbers.append(number)

    return numbers

def find_phone_numbers_on_website(url):
    stream = urllib.request.urlopen(url)
    numbers = []

    for line in stream:
        decoded = line.decode("UTF-8").strip()
        numbers += find_all_phone_numbers(decoded)
    return numbers

print(is_valid_password("password"))
print(is_valid_password("abCd723223"))
print(find_possible_names("some text before Mark and more"))
print(find_possible_names("some text before Mark and Bob stuff"))
print(find_all_possible_names("some text before Mark and Bob stuff"))
print(find_all_phone_numbers("askljaslkdf 434-982-2688 alksdjfla"))
print(find_phone_numbers_on_website("http://www.archives.gov/about/organization/telephone-list.html"))

Where might we look for some emails to harvest…