Lecture 27 - Reading the Web

Lecture Date: Monday, March 28

We will first start by reviewing regular expressions, doing a few more examples. For instance, let’s see if we can find a list of characters from Alice in Wonderland.

Reading plain text is one thing. Reading through an HTML website is a bit trickier.

Like we did for the lab this week, we could just basically ignore all the HTML stuff and just grab what we want. But what if the stuff we want is indicated by the HTML?

BeautifulSoup is an HTML parsing library. It allows you to specifically look through HTML to find particular tags that you can pull out.

BeautifulSoup4 Docs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import urllib.request
import re

url = "http://cs1110.cs.virginia.edu/alice.txt"
possible_names = {}

stream = urllib.request.urlopen(url)
for line in stream:
    decoded = line.decode("UTF-8").strip()
    regex = re.compile(r"[A-Z][a-z]+")
    results = regex.findall(decoded)
    if len(results) > 0:
        for result in results:
            if result in possible_names:
                possible_names[result] += 1
            else:
                possible_names[result] = 1

print(possible_names)

for item in possible_names:
    if possible_names[item] > 30:
        print(item, possible_names[item])

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# First install BeautifulSoup4

import urllib.request
import bs4
import re

web = urllib.request.urlopen("https://cs1110.cs.virginia.edu/souptest.html")
page = web.read()

parsedPage = bs4.BeautifulSoup(page, "html.parser")

for tag in parsedPage.find_all('h1'):
    print(tag)

web = urllib.request.urlopen("http://www.virginiasports.com/sports/m-baskbl/sched/va-m-baskbl-sched.html")
page = web.read()

parsedPage = bs4.BeautifulSoup(page, "html.parser")

for tag in parsedPage.find_all('td', class_="row-text"):
    print(tag)

Example HTML for Parsing
1
2
3
4
5
6
7
8
9
10
11
12
13
<html>
  <head>
      <title>Test Page!</title>
  </head>
  <body>
      <h1>Section 1</h1>
          Here is some really interesting information!
      <h1>Section 2</h1>
          Some more really cool stuff!
      <h1>Section 3</h1>
          Even more stuff you care about!
  </body>
</html>