Date

Lecture Date: Monday, October 24

We will first start by reviewing regular expressions, doing a few more examples. One good use for regular expressions is password validation:

^.{4,8}$ - Any set of characters, between 4 and 8 characters long.

^[a-zA-Z]\w{3,14}$ - Must start with a letter, then can have anything else between 3 and 14 characters long.

Reading plain text is one thing. Reading through an HTML website is a bit trickier. We could just basically ignore all the HTML stuff and just grab what we want. But what if the stuff we want is indicated by the HTML?

BeautifulSoup is an HTML parsing library. It allows you to specifically look through HTML to find particular tags that you can pull out.

BeautifulSoup4 Docs - http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Using Beautiful Soup:

# First install BeautifulSoup4

import urllib.request
import bs4

web = urllib.request.urlopen("https://cs1110.cs.virginia.edu/souptest.html")
page = web.read()

parsedPage = bs4.BeautifulSoup(page, "html.parser")

for tag in parsedPage.find_all('h1'):
    print(tag)

HTML you can parse:

<html>
    <head>
        <title>Test Page!</title>
    </head>
    <body>
        <h1>Section 1</h1>
            Here is some really interesting information!
        <h1>Section 2</h1>
            Some more really cool stuff!
        <h1>Section 3</h1>
            Even more stuff you care about!
    </body>
</html>

Reading the UVa Men's Basketball schedule:

import urllib.request
import bs4

web = urllib.request.urlopen("http://www.virginiasports.com/sports/m-baskbl/sched/va-m-baskbl-sched.html")
page = web.read()

parsedPage = bs4.BeautifulSoup(page, "html.parser")

for tag in parsedPage.find_all('td', class_="sch-col-1"):
    print(tag)

Reading Lou's List:

import urllib.request
import bs4

web = urllib.request.urlopen("http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1168&Type=Group&Group=CompSci")
page = web.read()

parsed_page = bs4.BeautifulSoup(page, "html.parser")

for section_tag in parsed_page.find_all('td', class_="CourseName"):
    print(section_tag.text)