Lecture 28 - Reading the Web

Lecture Date: Wednesday, March 30

Let’s start by doing another data analysis exercise. We’ll use country population data from the World Health Organization.

I’ve hosted the datafile here: http://cs1110.cs.virginia.edu/pop.csv

Have you ever gone to a website and wanted to download, say, all of the pictures on that page without having to click all of them? Maybe you want to download a whole bunch of .mp3 files?

Today we’re going to take what we learned last time about how to read and parse webpages and put it to good use!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Mark Sherriff (mss2x)

import urllib.request

stream = urllib.request.urlopen( "http://cs1110.cs.virginia.edu/pop.csv" )

countries = []
next(stream)

for line in stream:
    decoded = line.decode("UTF-8").strip().split(",")
    countries.append((decoded[0], int(decoded[1])))

countries.sort(key=lambda popluation: popluation[1], reverse=True)

for country in countries:
    print(country)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Mark Sherriff (mss2x)
# Code based on https://automatetheboringstuff.com/chapter11/
import urllib.request, os, bs4

count = 10 # how many comics to download

url = 'http://xkcd.com'              # starting url
os.makedirs('xkcd', exist_ok=True)   # store comics in ./xkcd

while count > 0:

    # First, download the page.
    print('Downloading page', url)
    webpage = urllib.request.urlopen(url)

    parsed_page = bs4.BeautifulSoup(webpage.read(), "html.parser")

    # Use BeautifulSoup to find the URL of the comic image.
    comic_elem = parsed_page.select('#comic img')
    if comic_elem == []:
         print('Could not find comic image.')
    else:
        comic_url = 'http:' + comic_elem[0].get('src')
        # Download the image.
        print('Downloading image', comic_url)
        comic_page = urllib.request.urlopen(comic_url)
        count -= 1

        # Save the image to ./xkcd.
        image_file = os.path.join('xkcd', os.path.basename(comic_url))
        with open(image_file, 'b+w') as file:
            file.write(comic_page.read())
        file.close()

    # Get the Prev button's url.
    prev_link = parsed_page.select('a[rel="prev"]')[0]
    url = 'http://xkcd.com' + prev_link.get('href')

print('Done.')