Date

October 27, 2016

Activity 1: Login and Record Attendance

We will be taking roll in lab each week! Please come to your assigned lab to be counted present!

If you are in an Olsson lab, click "Lab Attendance" on the left-hand menu in Collab to register your attendance and keep up with your lab grade.
YOU MUST CLICK THE LARGE GREEN OR YELLOW BUTTON FOR YOUR ATTENDANCE TO COUNT!!

You must do this from a machine in Olsson 001 and not your laptop. If you have trouble, talk to your lab TA. Students in Lab 109 will do attendance via direction from the TA.

Activity 2: Take Quiz 5

While you are waiting for lab to start, click on Tests & Quizzes in Collab from either the desktop or laptop and take Quiz 5 - a short review quiz on the material we have covered thus far. If you do not complete it today, you have until Sunday to do so.

Activity 3: Starting Out

First, find a partner to work with! The TA's will quickly review regular expressions again. Remember - you can test your expressions with a website like http://www.regexr.com/ or https://regex101.com/.

Activity 4: Combing Through Data

For lab today, we're going to look through a relatively large data set - the log file from the server running our course website.

The server's name is stardock.cs.virginia.edu. This machine is hosting the websites for multiple classes, the office hours queue for 1110, 1113, and 2110, and all of Sherriff's own personal projects. Can you figure out what takes up the most traffic? Where are people accessing the server from - on grounds or at home?

We're going to look a one 24-hour time period - noon on Sunday, Nov 1, 2015 through noon on Monday, Nov 2, 2015. Please download this file and put in into your PyCharm project directory. The file is really too large to download each time.

access.log file - http://cs1110.cs.virginia.edu/code/access.log - 19.2 MB

Open the file in a text editor if you want to see what the data looks like. Here is an example:

172.26.30.213 - - [01/Nov/2015:12:00:14 -0500] "GET /oh/queue_count.php HTTP/1.1" 302 1886 "https://stardock.cs.virginia.edu/oh/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"

The important bits here are:

  • 172.26.30.213 - This is the IP address where the request is coming from.
  • [01/Nov/2015:12:00:14 -0500] - The date and time of the request.
  • GET /oh/queue_count.php - The HTTP command being executed. This says "go find me this page and send it to me."
  • https://stardock.cs.virginia.edu/oh/ - This is the referring URL. It shows what page someone was on when the command executed.
  • The rest of the line - This provides OS and browser information.

Activity 5: Answering Questions

Create a new python file called log_hunt.py where you will write your program.

There are a number of interesting questions we could answer by looking at this data. When was the most traffic? Was something going on then (office hours, an assignment due, etc.)? Where did most of the traffic come from?

We'll limit ourselves to two specific questions:

  • What referrer was generating the most traffic?
  • What wireless network were more people on a the time - wahoo or cavalier?

For the first question, you should create a regular expression that can find all of the URLs on a given line. Feel free to do some Google searching to help you with this! Then using whatever method you want (dictionary, list, etc.) figure out which URL was generating the most traffic.

For the second question, you should create a regular expression that can find all of the IP addresses on an given line. Be careful - the Google Chrome browser version looks suspiciously like an IP address, so you'll have to account for that. By checking the network info page at ITS at http://its.virginia.edu/network/ipspace.html, we can see which IP ranges are assigned to the various networks. To make things slightly simpler, we'll say that any IP address that starts with 172.25. is on cavalier and any that starts with 172.27. is on wahoo. Report back how many requests come from each domain set.

Submission

Each partner should submit one .py file named log_hunt.py to the submission system at https://archimedes.cs.virginia.edu/cs1110/. Please put both partners' names and id's in two comments at the top of the file.

You must submit on time! Even if you don't finish, submit what you have.

Solution Code

import re

log_file = open("access.log", "r")

ip_addresses = {}
urls = {}

wahoo_address = 0
cavalier_address = 0

regex = re.compile(r"[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}")
regex_url = re.compile(r"https:\/\/(\w|\.|\/)*")

for line in log_file:
    results = regex.findall(line)
    url_results = regex_url.search(line)
    for address in results:
        if address.startswith("172.25."):
            cavalier_address += 1
        if address.startswith("172.27."):
            wahoo_address += 1

        if address in ip_addresses:
            ip_addresses[address] += 1
        else:
            ip_addresses[address] = 1
    if url_results != None:
        url = url_results.group()
        if url in urls:
            urls[url] += 1
        else:
            urls[url] = 1

print(ip_addresses)
print(urls)
print(wahoo_address)
print(cavalier_address)