Lecture 31 - Reviewing Functions

Lecture Date: Friday, November 6

Today, we’re going to review a lot of what we’ve done up to this point to get ready for the test!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Mark Sherriff (mss2x)

import urllib.request, bs4, random

def get_links(url):
    web = urllib.request.urlopen(url)
    page = web.read()

    parsedPage = bs4.BeautifulSoup(page, "html.parser")
    links = []

    for tag in parsedPage.find_all('a', {"href" : True}):
        if tag['href'].startswith('/wiki/'):
            links.append(tag['href'])
    return links

def pick_random_link(_links):
    rand_int = random.randint(0,len(_links)-1)

    while _links[rand_int].find("File:") != -1:
        rand_int = random.randint(0,len(_links)-1)

    return _links[rand_int]

print("Welcome to the Wikipedia Game!")
url = input("What page do you want to start with?: ")
hops = int(input("What is the maximum number of hops?: "))
url = 'https://en.wikipedia.org/wiki/' + url
url = url.replace(' ', '_')
path = []
path.append(url)
while hops > 0:
    link_list = get_links(url)
    url = 'https://en.wikipedia.org' + pick_random_link(link_list)
    path.append(url)
    hops -= 1

print("Game created!")
print("You need to get from:", path[0])
print("To:", path[-1])
input("Press Enter to see the answer!")

for link in path:
    print(link)

No audio today


Test 2 Review Guide

Test date: Wednesday, November 11, 2015
Test location: normal lecture hall (you MUST go to your assigned section)
Test duration: 50 minutes (for 1110 and 1111)
Test format: writing on paper; bring pen/pencil (and nothing else)
Review session: Tuesday, November 10, 6:00-7:00 PM in Minor 125

Review Sheet

General Points:

  • The format of the test will largely match that of Test 1 - some short answer, some code reading, some multiple choice, and a large coding question at the end.
  • Everything from the first part of the semester is still fair game. It’s hard to ask a coding question without you still knowing how to do an if statement, for instance.

Not on the Test:

  • Specifics of the encryption chase.

Functions:

  • You WILL be asked to write functions on this test
  • Know what the function header / signature is
  • Know the difference between a void return and a function that returns a data type of some kind (and when you would use each)
  • Know what named / optional parameters are and how they are used
  • Know what positional parameters are and that they must come first in a function call
  • Know the difference between pass by value and pass by reference

Files:

  • Know how to read structured files (i.e. csv) from a local file or the Internet
  • Know how to write data to a file

String Processing:

  • Know some of the basic methods for parsing strings (.find(), .replace(), etc.)
  • Know what regular expressions are, how to read a simple one, and when you might use them (but we will not ask you to write a regular expression)
  • Know what BeautifulSoup does and when you would use it (but we will not ask you to use it in any code on the test)

Sending and Reading Email:

  • Know what SMTP and IMAP are (no coding)

Study Hints

Practice coding on paper

You’ll be writing code on paper. This feels different than writing it in Eclipse. You don’t want to be surprised by how different it feels. Practice writing code by hand.

A few small syntax errors is OK. But if you are really off we will take off points. Try to write correct code.

We’ll give you any imports you might need - so don’t worry about memorizing those.

Try re-solving the POTD and Lab problems on paper without looking at your code or textbook.

You can find more sample problems in Programming Challenges in the textbook. We do not, however, have the answer key to share with you.

Also remember that speed matters. 50 minutes is not a long time.

Practice reading code

We will show you code and ask you what it does. You won’t be able to have Java run it. Practice thinking through code without running it.

Review the Lectures

Not everything in the book is equally important. Review the lecture notes to see what we emphasized. If you are confused by some point, check the audio. You might want to listen to the audio of the other instructor (the one you didn’t hear in class) so that you can get a different perspective on the material.


Lecture 30 - Sending Email

Lecture Date: Wednesday, November 4

We can harvest emails. We can parse webpages for data. We’re really good at gathering info from other people. Let’s take a look at generating some data!

Today we’re going to learn how to send email via Python. We’ll look at SMTP and IMAP.

SMTP stands for the Simple Mail Transport Protocol. It’s one of the earliest protocols used on the Internet. Mail servers use this protocol to send and receive messages between them - that’s how mail is passed from one network to the next. Mail is sent from server to server until it reaches the destination mail server, where it is placed in a queue waiting to be downloaded by the user.

IMAP stands for the Internet Message Access Protocol. IMAP messages tend to “live” on the mail server and the mail client simply reads and manipulates them remotely. The idea is that people could use different mail clients from different locations to manage their mail. This is different from POP (Post Office Protocol) in which messages are downloaded (not too many clients still use this protocol). So, IMAP is what is used so you can see your email on your laptop, your phone, and also in a browser.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Mark Sherriff (mss2x)
# Send email through Gmail

import smtplib

login = input("My Gmail account: ")
password = input("My Gmail password: ")
to_address = input("Send email to: ")
message = input("The message is: ")

smtp_conn = smtplib.SMTP('smtp.gmail.com', 587)

smtp_conn.starttls()

smtp_conn.login(login, password)

smtp_conn.sendmail(login, to_address, message)

smtp_conn.quit()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Mark Sherriff (mss2x)
# Downloads your newest email

import imapclient
import pyzmail

login = input("My Gmail account: ")
password = input("My Gmail password: ")

imap_conn = imapclient.IMAPClient('imap.gmail.com', ssl=True)
imap_conn.login(login, password)
imap_conn.select_folder('INBOX', readonly=True)

UIDs = imap_conn.search() # Get a list of messages in your Inbox
newest_message = UIDs[-1] # Get the last one in the list - the newest

# Get the binary text of the message
raw_messages = imap_conn.fetch(newest_message, ['BODY[]', 'FLAGS'])

# Process the message into something readable
message = pyzmail.PyzMessage.factory(raw_messages[newest_message][b'BODY[]'])

print(message.get_subject())
print(message.get_address('to'))
print(message.get_address('from'))
print(message.text_part.get_payload().decode())

imap_conn.logout()


Lecture 29 - Reading the Web 2

Lecture Date: Monday, November 2

Have you ever gone to a website and wanted to download, say, all of the pictures on that page without having to click all of them? Maybe you want to download a whole bunch of .mp3 files?

Today we’re going to take what we learned last time about how to read and parse webpages and put it to good use!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Mark Sherriff (mss2x)
# Code based on https://automatetheboringstuff.com/chapter11/
import urllib.request, os, bs4

count = 10 # how many comics to download

url = 'http://xkcd.com'              # starting url
os.makedirs('xkcd', exist_ok=True)   # store comics in ./xkcd

while count > 0:

    # First, download the page.
    print('Downloading page', url)
    webpage = urllib.request.urlopen(url)

    parsed_page = bs4.BeautifulSoup(webpage.read(), "html.parser")

    # Use BeautifulSoup to find the URL of the comic image.
    comic_elem = parsed_page.select('#comic img')
    if comic_elem == []:
         print('Could not find comic image.')
    else:
        comic_url = 'http:' + comic_elem[0].get('src')
        # Download the image.
        print('Downloading image', comic_url)
        comic_page = urllib.request.urlopen(comic_url)
        count -= 1

        # Save the image to ./xkcd.
        image_file = os.path.join('xkcd', os.path.basename(comic_url))
        with open(image_file, 'b+w') as file:
            file.write(comic_page.read())
        file.close()

    # Get the Prev button's url.
    prev_link = parsed_page.select('a[rel="prev"]')[0]
    url = 'http://xkcd.com' + prev_link.get('href')

print('Done.')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Mark Sherriff (mss2x)
# Code based on https://automatetheboringstuff.com/chapter11/
import urllib.request, os, bs4

count = 0 # how many comics to download

url = 'http://imgur.com/topic/Halloween'              # starting url
os.makedirs('imgur', exist_ok=True)   # store comics in ./imgur

# First, download the page.
print('Downloading page', url)
webpage = urllib.request.urlopen(url)

parsed_page = bs4.BeautifulSoup(webpage.read(), "html.parser")

# Use BeautifulSoup to find the URL of the comic image.
comic_elem = parsed_page.select('.image-list-link')
if comic_elem == []:
     print('Could not find comic image.')
else:
    while count < 10:
        image_name = comic_elem[count].get('href')
        comic_url = 'http://i.imgur.com' + image_name[16:] + '.jpg'
        # Download the image.
        print('Downloading image', comic_url)
        comic_page = urllib.request.urlopen(comic_url)
        count += 1

        # Save the image to ./imgur.
        image_file = os.path.join('imgur', os.path.basename(comic_url))
        with open(image_file, 'b+w') as file:
            file.write(comic_page.read())
        file.close()

print('Done.')


Lecture 28 - Reading the Web

Lecture Date: Friday, October 30

Reading plain text is one thing. Reading through an HTML website is a bit trickier.

Like we did for the lab this week, we could just basically ignore all the HTML stuff and just grab what we want. But what if the stuff we want is indicated by the HTML?

BeautifulSoup is an HTML parsing library. It allows you to specifically look through HTML to find particular tags that you can pull out.

BeautifulSoup4 Docs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# First install BeautifulSoup4

import urllib.request
import bs4
import re

web = urllib.request.urlopen("https://cs1110.cs.virginia.edu/souptest.html")
page = web.read()

parsedPage = bs4.BeautifulSoup(page, "html.parser")

for tag in parsedPage.find_all('h1'):
    print(tag)

web = urllib.request.urlopen("http://www.virginiasports.com/sports/m-baskbl/sched/va-m-baskbl-sched.html")
page = web.read()

parsedPage = bs4.BeautifulSoup(page, "html.parser")

for tag in parsedPage.find_all('td', class_="row-text"):
    print(tag)

Example HTML for Parsing
1
2
3
4
5
6
7
8
9
10
11
12
13
<html>
  <head>
      <title>Test Page!</title>
  </head>
  <body>
      <h1>Section 1</h1>
          Here is some really interesting information!
      <h1>Section 2</h1>
          Some more really cool stuff!
      <h1>Section 3</h1>
          Even more stuff you care about!
  </body>
</html>


Lecture 27 - Regular Expressions

Lecture Date: Wednesday, October 28

Using the string libraries to parse a body of text is really good for reading information from top to bottom and making decisions based on that information. It’s also good for parsing data into another data structure.

But sometimes you just want to find all of one particular kind of data. Or you want to verify that some piece of text follows a very particular format. For those instances, we can use regular expressions.

A regular expression is a form of pattern matching. Using a defined set of tokens, we can set up a pattern that must be met before a piece of text will be accepted and returned.

Here’s a starter list of tokens that you might use:

1
2
3
4
5
6
7
8
[abc] - one of those three characters
[a-z] - all lowercase
[a-z0-9] - add in numbers
. - any one character
\. - an actual period
* - 0 to many
? - 0 or 1
+ - 1 to many

Let’s test some regular expressions at https://regex101.com/!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Mark Sherriff (mss2x)

import re

def is_valid_password(password):
    regex = re.compile(r"^.*(?=.{8,})(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).*$")
    results = regex.search(password)
    if results != None:
        return True
    return False

def find_possible_names(text):
    regex = re.compile(r"[A-Z][a-z]*")
    results = regex.search(text)
    if results != None:
        return results.group()
    return None

def find_all_possible_names(text):
    names = []

    regex = re.compile(r"[A-Z][a-z]*")
    results = regex.findall(text)
    if len(results) != 0:
        for name in results:
            names.append(name)
    return names


print(is_valid_password("password"))
print(is_valid_password("abCd723223"))
print(find_possible_names("some text before Mark and more"))
print(find_possible_names("some text before Mark and Bob stuff"))
print(find_all_possible_names("some text before Mark and Bob stuff"))


Lecture 26 - String Processing

Lecture Date: Monday, October 26

Now that you have your partner project, we should look closer at how to parse text and look for the information you want. Often, text is messy - it isn’t nicely laid out like a CSV file where each data point is separated cleanly from the next. Sometimes you have to figure out ways to hunt through a lot of information to pull out just the one nugget you want.

Let’s look through the string API to see what we can find!

Python str API

Python string API

Full Text of the Last Debate of 2012 Presidential Campaign

Functions to know:

  • startswith()
  • endswith()
  • strip(), rstrip(), lstrip()
  • count()
  • find(), rfind()
  • index(), rindex()
  • join()
  • replace()
  • split()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Mark Sherriff (mss2x)

speaking = ""
count = 0
line_number = 1

debate_file = open("2012debate.txt", "r")

for line in debate_file:
    line = line.strip()
    if line.startswith("SCHIEFFER"):
        speaking = "SCHIEFFER"
    elif line.startswith("OBAMA"):
        speaking = "OBAMA"
    elif line.startswith("ROMNEY"):
        speaking = "ROMNEY"

    if "America" in line:
        words = line.split(" ")
        for word in words:
            if "America" in word:
                print(speaking, "said America on line", line_number)
                count += 1
    line_number += 1

print(count)
1
2
3
4
5
6
7
text = '<a href="mailto:sherriff@virginia.edu">Email Me!</a>'

at_sign = text.index('@')
colon = text.index(":")
end_quote = text.index('"', at_sign)

print(text[colon+1:end_quote])


Lecture 25 - Writing Files 2

Lecture Date: Friday, October 23

Reading files is great, but what if we want to write stuff to disk? How can we do this… and not blow up our computers? Consider what would happen if we created an infinite loop….

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Mark Sherriff (mss2x)

import os

print(os.listdir(os.getcwd()))

total_size = 0
for filename in os.listdir(os.getcwd()):
      total_size = total_size + os.path.getsize(os.path.join(os.getcwd(),filename))

print(total_size)

output_file = open("output_file.txt", "w")

output_file.write(str(os.listdir(os.getcwd())))
output_file.write('\n')
output_file.write(str(total_size))
output_file.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
shopping_list = []

# Read the file into the list
datafile = open("shopping_list.txt", "r")

for line in datafile:
    line = line.strip()
    shopping_list.append(line)
datafile.close()

print("Your current shopping list is:")
for item in shopping_list:
    print(item)

print()
while True:
    item_to_add = input("Items to add (NONE to stop): ")
    if item_to_add.upper() == "NONE":
        break
    shopping_list.append(item_to_add)

print()
while True:
    item_to_remove = input("Items to remove (NONE to stop): ")
    if item_to_remove.upper() == "NONE":
        break
    shopping_list.remove(item_to_remove)

print()
print("Your current shopping list is:")
for item in shopping_list:
    print(item)

print()
print("Saving to shopping_list.txt...")
datafile = open("shopping_list.txt", "w")

for item in shopping_list:
    datafile.write(item + "\n")