Date

Due: Friday, November 4, 11:00 AM

Have you ever wondered how spammers get all their email addresses to send those lovely messages about foreign princes with banking troubles or the latest and greatest medication for... whatever?

We here in CS 111X want to equip you with the skills to become a spammer yourself! Well, not really, but the techniques used to farm email addresses can be really useful!

The goal of this assignment is to write a program that will scan a web page and harvest as many email addresses as possible. Many of these email address will be obfuscated in some way. It's up to you to get the computer to figure out how to recognize the obfuscation and return a good result!

Your grade will be based on how many (and what types) of email addresses you can find in the page. We will provide examples of most types of obfuscation, but not necessarily all. Some bonus points may be earned for some really tricky ones.

Here are some examples to get you started (in the form "obfuscated email" => "what your program should interpret the email as"):

  • mst3k@Virginia.EDU => mst3k@Virginia.EDU
  • thomas.jefferson@cs.virginia.edu => thomas.jefferson@cs.virginia.edu
  • mst3k at virginia.edu => mst3k@virginia.edu
  • mst3k at virginia dot edu => mst3k@virginia.edu

Tips

You can read the entire web page line by line to make it easier to search.

import urllib.request

stream = urllib.request.urlopen( "https://cs1110.cs.virginia.edu/emails.html" )

for line in stream:
    decoded = line.decode("UTF-8")
    print(decoded.strip())

Once you have a line from the web site, you have a couple different options:

  1. You can manually look for particular symbols by using the in keyword. For example, you could try if "@" in line: to see if there is an @ sign in the line you are looking at. If so, you might want to take a closer look.
  2. You can come up with regular expressions that will look for particular patterns in a line that could be an email address. You can test regular expressions against test data you provide here: http://www.regexr.com/ or https://regex101.com/

You cannot use BeautifulSoup or any other HTML parsing library for this as most of the email addresses are not within HTML tags that you can identify. So, we're going to save you some time here and just say don't try it. Further, the server will just reject your assignment.

No one method or one regular expression will get every email address. As mentioned above, we've intentionally put some extremely difficult addresses in the page just to see what you can do!

Your program must implement a particular function:

find_emails_in_website(url): This function takes as input a string representation of the URL of a website that you want to search and should return a list of the email addresses you find.

You can create as many other functions as you like, but this is the function that we will call with various different sites to see how well your program works.

We have a couple pages that you can use to test your code. These pages have a set of example emails you should be able to find (and some that you can look for but we are not requiring). Do not "hardcode" your solutions! In other words, if you write your code to look for these exact emails, then it won't work when we test it. These are just examples.

Example 1 is at: https://cs1110.cs.virginia.edu/emails.html

For the example page, you should hopefully find:

basic@virginia.edu
link-only@virginia.edu
multi-domain@cs.virginia.edu
Mr.N0body@cand3lwick-burnERS.rentals
a@b.ca
no-at-sign@virginia.edu
no-at-or-dot@virginia.edu
first.last.name@cs.virginia.edu
with-parenthesis@Virginia.EDU
added-words1@virginia.edu
added-words2@virginia.edu
may.end@with-a-period.com

Example 2 is at: https://cs1110.cs.virginia.edu/emails2.html

For the example page, you should hopefully find:

abasicemail@wfu.edu
a-link-only@unc.edu
so-many-domains@ece.berkeley.edu
SomE.CRAZY343@ea.info
w@x.yz
an-at-sign@ncsu.edu
some-other-email@gt.com
so.many.periods@why.do.this
parensarecool@duke.edu
morewords@place.net
extrawords@coolrunnings.ja
period.at@at.the.end

Submission: Submit your file email_finder.py to the POTD submission system at https://archimedes.cs.virginia.edu/cs1110/. DO NOT submit any testing files. We only want your email_finder.py file.

NOTE: Make sure to remove all print() statements from your code before submitting. We will not run any tests on any file that still has print() statements in the code!