Date

Lecture Date: Friday, October 21

Using the string libraries to parse a body of text is really good for reading information from top to bottom and making decisions based on that information. It's also good for parsing data into another data structure.

But sometimes you just want to find all of one particular kind of data. Or you want to verify that some piece of text follows a very particular format. For those instances, we can use regular expressions.

A regular expression is a form of pattern matching. Using a defined set of tokens, we can set up a pattern that must be met before a piece of text will be accepted and returned.

Here's a starter list of tokens that you might use:

[abc] - one of those three characters
[a-z] - all lowercase
[a-z0-9] - add in numbers
. - any one character
\. - an actual period
* - 0 to many
? - 0 or 1
+ - 1 to many

Let's test some regular expressions at https://regex101.com/!

What are some ways we can use regular expressions?

import re
import urllib.request

def is_valid_password(password):
    regex = re.compile(r"^.*(?=.{8,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$")
    results = regex.search(password)
    if results != None:
        return True
    return False

def find_possible_names(text):
    regex = re.compile(r"[A-Z][a-z]*")
    results = regex.search(text)
    if results != None:
        return results.group()
    return None

def find_all_possible_names(text):
    names = []

    regex = re.compile(r"[A-Z][a-z]*")
    results = regex.findall(text)
    if len(results) != 0:
        for name in results:
            names.append(name)
    return names

def find_all_phone_numbers(text):
    numbers = []

    regex = re.compile(r"[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]")
    results = regex.findall(text)
    if len(results) != 0:
        for number in results:
            numbers.append(number)

    return numbers

def find_phone_numbers_on_website(url):
    stream = urllib.request.urlopen(url)
    numbers = []

    for line in stream:
        decoded = line.decode("UTF-8").strip()
        numbers += find_all_phone_numbers(decoded)
    return numbers

print(is_valid_password("password"))
print(is_valid_password("abCd723223"))
print(find_possible_names("some text before Mark and more"))
print(find_possible_names("some text before Mark and Bob stuff"))
print(find_all_possible_names("some text before Mark and Bob stuff"))
print(find_all_phone_numbers("askljaslkdf 434-982-2688 alksdjfla"))
print(find_phone_numbers_on_website("http://www.archives.gov/about/organization/telephone-list.html"))

Can we improve our name finder for Alice in Wonderland?

import urllib.request
import re

url = "http://cs1110.cs.virginia.edu/alice.txt"
possible_names = {}

stream = urllib.request.urlopen(url)
for line in stream:
    decoded = line.decode("UTF-8").strip()
    regex = re.compile(r"[A-Z][a-z]+")
    results = regex.findall(decoded)
    if len(results) > 0:
        for result in results:
            if result in possible_names:
                possible_names[result] += 1
            else:
                possible_names[result] = 1

print(possible_names)

for item in possible_names:
    if possible_names[item] > 30:
        print(item, possible_names[item])

Where might we look for some emails to harvest...