1 Background

A lot of information is freely available on the web, but not a lot of it is in forms that computers find useful to read. A common use of regular expressions is to understand or parse data from a human-centric view like a webpage into a computer-centric view like CSV.

One source of data is the Richmond Times-Dispatch’s summary of Virginia state salaries’s, obtained using the state’s Freedom of Information statute. To avoid overloading the newpaper’s website with 600 students testing their code, we have a mirror of the 2015–2016 UVA salary data you can have your code access on our servers for this assignment.

http://cs1110.cs.virginia.edu/files/uva2015/

When you visit a page, you see a rendering of the content, but under the hood most web pages are written in a language called HTML. Like Python, this is a text-based representation that describes what the computer is supposed to do. Unlike Python, it is a markup language, not a programming language, meaning it is interested primarily in describing data rather than specifying processes. The nuances of HTML are not important for our task; all you need to do is write regular expressions to find specific information within it.

You can view the HTML of any web page using the view source option of your web browser (typically either via the keyboard shortcut Ctrl+U or Cmd+Opt+U). The source is what urllib.request will retrieve, what your regular expressions will need to look at, and what we discuss below.

2 Task

Write a file called salary.py that contains a single function, get_info, accepting a single string parameter, a person’s name. The function should

  • Find the URL for the name. Normalize the name first: "Teresa Sullivan" and "Sullivan, Teresa" and teresa-sullivan should all turn into the URL http://cs1110.cs.virginia.edu/files/uva2015/teresa-sullivan.

  • Read the contents of the URL into a string.

  • Use regular expressions to find the following:

    • Job title, e.g. the President - University of Virginia, which can be found in multiple locations in the website:

      <meta property="og:description" content="Job title: President - University of Virginia<br /> 2015 total gross pay: $733,800" />

      and

      <span class="small text-muted" id="personjob">President - University of Virginia</span>

      It doesn’t matter where you get it from in your code.

      If the job title contains &amp;, replace it with just &; likewise replace &lt; with < and &gt; with >.

      Note: a third location has the wrong case:

      <title>Teresa Sullivan salary - President - University Of Virginia - 2015_University of Virginia -  - Richmond.com - Richmond Times-Dispatch</title>

      Don’t use that one; we’ll be looking for the version from the meta or span lines, not the title line.

    • Total compensation, converted to a float. Again, this is multiple places:

      <meta property="og:description" content="Job title: President - University of Virginia<br /> 2015 total gross pay: $733,800" />

      and

      <h2 class="pay" id="paytotal">$733,800</h2>

      and

      <div style="margin:0; float:left; background:#337ab7; height:100%; width:<%= getPct(paytype.amount, 733800.00) %>%;"></div>
    • Compensation ranking compared to other UVA employees, e.g. the 1 in

      <tr><td>University of Virginia rank</td><td>1 of 7,474</td></tr>

      Turn this into an int (after removing a comma, if any). Not all people will have a rank listed. If there is no rank, do not include 'rank' in the returned dict.

    • Compensation breakdown as a dict with str keys and float values; e.g.

      {'Base salary': 188617.0, 'Additional compensation': 4100.0, 'Non-state salary': 346083.0, 'Deferred compensation', 180000.0}

      based on

      var pay = [ {'name': 'Base salary', 'amount': 188617 }, {'name': 'Additional compensation', 'amount': 4100 }, {'name': 'Non-state salary', 'amount': 346083 }, {'name': 'Deferred compensation', 'amount': 180000 } ];

      Note that the set of keys in the breakdown dict will depend on the employee; for example, 151028368’s line looks like just

      var pay = [ {'name': 'Base salary', 'amount': 41000 } ];

      and would create the dict

      {'Base salary': 41000.0}
  • Return the results in a dict, like so:

    {'title': 'President - University of Virginia', 'pay': 733800.0, 'rank': 1, 'breakdown': {'Base salary': 188617.0, 'Additional compensation': 4100.0, 'Non-state salary': 346083.0, 'Deferred compensation': 180000.0}}

    If the name does not map into a URL we have, instead return an empty dict, {}

3 Example runs

In addition to "Teresa Sullivan", noted above, you should try out

  • an unranked anonymous employee like "151028368",
  • a hyphenated surname like "Polanowska-Grabow, Renata",
  • a multiple-spaces name like "Ali Reza Forghani Esfahani",
  • a job title with a special character like "pamela-neff", and
  • a person not listed like "thomas-jefferson".

In each case, your results should match what you find on the Richmond Times-Dispatch website and our mirror copy of it (except the missing person who should return {}).

4 Troubleshooting

You will probably find it easier to create an empty dict first and add new key/value pairs to it as you find them. This will help particularly with the optional keys like rank and set of the keys in the breakdown dict.


There are a lot of things going on in this function. Good coding practice would be to make a few extra functions to help out; for example it might make sense to have a name_to_URL(name) function, etc.

Incidentally, the name_to_url proceess does not need regular expressions; it is enough to

  1. find a comma (using in) and reorder the text if there is one (move what was before the comma to be after it)
  2. make it lower-case
  3. change spaces to hyphens

The easiest way to tell if the URL does not exist is to open it in a try: and handle failure in an except:


Since the values are identifiable by their surrounding context, try making regular expressions that match that context and keep the answer in a group, such as <td>([0-9]+) of for finding rank (which does not quite work as written…).


When search fails to find something, it returns None.

For most of the elements of the answer, search is the regular expression method you’re most likely to want. However, the compensation breakdown will probably benefit from a finditer, with a regular expression that matches the key in one group and the value in another group.


Case matters. If the website describes a person’s Base salary report their Base salary not their Base Salary.


For the compensation breakdown, try looking for all generic 'name': '...', 'ammount': ... strings and adding each one to an initially-empty dict. Using groups and the match.group(number) method will make this much easier than not using those tools.