Lecture 25 - String Processing

Lecture Date: Wednesday, March 23

We are going to look closer at how to parse text and look for the information you want after you download/open a file. Often, text is messy - it isn’t nicely laid out like a CSV file where each data point is separated cleanly from the next. Sometimes you have to figure out ways to hunt through a lot of information to pull out just the one nugget you want.

Let’s look through the string API to see what we can find!

Python str API

Python string API

Full Text of the Last Debate of 2012 Presidential Campaign

Functions to know:

  • startswith()
  • endswith()
  • strip(), rstrip(), lstrip()
  • count()
  • find(), rfind()
  • index(), rindex()
  • join()
  • replace()
  • split()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Mark Sherriff (mss2x)

speaking = ""
count = 0
line_number = 1

debate_file = open("2012debate.txt", "r")

for line in debate_file:
    line = line.strip()
    if line.startswith("SCHIEFFER"):
        speaking = "SCHIEFFER"
    elif line.startswith("OBAMA"):
        speaking = "OBAMA"
    elif line.startswith("ROMNEY"):
        speaking = "ROMNEY"

    if "America" in line:
        words = line.split(" ")
        for word in words:
            if "America" in word:
                print(speaking, "said America on line", line_number)
                count += 1
    line_number += 1

print(count)
1
2
3
4
5
6
7
text = '<a href="mailto:sherriff@virginia.edu">Email Me!</a>'

at_sign = text.index('@')
colon = text.index(":")
end_quote = text.index('"', at_sign)

print(text[colon+1:end_quote])