Date

Lecture Date: Wednesday, October 26

Let's review what we've done and do some more examples!

First, back to that weather example we ended with on Monday... Yeah... that was slightly harder than I thought it would be. We have to use a slightly different method to get a very particular tag that says that it is the temperature. This data is stored in a tag attribute, which we can access with the curly braces as shown below.

# https://www.wunderground.com/US/VA/Charlottesville.html

import urllib.request
import bs4

web = urllib.request.urlopen("https://www.wunderground.com/US/VA/Charlottesville.html")
page = web.read()

parsedPage = bs4.BeautifulSoup(page, "html.parser")

for tag in parsedPage.find_all("span", {"data-variable" : "temperature"}, class_="wx-data"):
    print(tag.span.text)

What could you do with this? Well, it's actually pretty easy to have this code continuously running on a Raspberry Pi with one of these hooked up to it: https://www.adafruit.com/products/306. And then you can have a light bar changing color based on the temperature! Neat!

Have you ever gone to a website and wanted to download, say, all of the pictures on that page without having to click all of them? Maybe you want to download a whole bunch of .mp3 files?

# Code based on https://automatetheboringstuff.com/chapter11/
import urllib.request, os, bs4

count = 10 # how many comics to download

url = 'http://xkcd.com'              # starting url
os.makedirs('xkcd', exist_ok=True)   # store comics in ./xkcd

while count > 0:

    # First, download the page.
    print('Downloading page', url)
    webpage = urllib.request.urlopen(url)

    parsed_page = bs4.BeautifulSoup(webpage.read(), "html.parser")

    # Use BeautifulSoup to find the URL of the comic image.
    comic_elem = parsed_page.select('#comic img')
    if comic_elem == []:
         print('Could not find comic image.')
    else:
        comic_url = 'http:' + comic_elem[0].get('src')
        # Download the image.
        print('Downloading image', comic_url)
        comic_page = urllib.request.urlopen(comic_url)
        count -= 1

        # Save the image to ./xkcd.
        image_file = os.path.join('xkcd', os.path.basename(comic_url))
        with open(image_file, 'b+w') as file:
            file.write(comic_page.read())
        file.close()

    # Get the Prev button's url.
    prev_link = parsed_page.select('a[rel="prev"]')[0]
    url = 'http://xkcd.com' + prev_link.get('href')

print('Done.')

If time, we'll pick a dataset from https://github.com/fivethirtyeight/data and do some more example programs.