Lösung Teil 2: Namen untersuchen
Jetzt kommt ein kleines python-Programm, dass den Rest erledigt. Ich häng es einfach mal an und hoffe, dass die Kommentare ausreichen, es zu verstehen:
(Ich würde es gerne als herunterladbare Datei posten, aber Wordpress lässt mich nicht!)
#!/usr/bin/env python
#A short script that reads a raw list of physicians and
#prints out the ones that have their last name with all
#their characters in aphabetical ascending order, i.e.
#John Ace
import string
f = open("/home/martin/physician_list.txt","r")
lines = f.readlines()
#We'll examine the file line by line
for line in lines:
#Step One: Remove all garbage like whitespaces and newlines in each line
clean_one = string.strip(line)
#Step Two: Remove all additional info in each line that may be added in
#parantheses (i.e.: John Doe (physician))
#string.find() returns -1 when nothing is found!
garbage_start_index = string.find(clean_one, "(")
if garbage_start_index < 0:
garbage_start_index = len(clean_one)
clean_two = clean_one[:garbage_start_index]
#We will get a bit more white spaces, let's remove these again and make
#everything lowercase
clean_three = string.strip(string.lower(clean_two))
#Now lets get the last word in each string - assuming it's the last name
#Let's hope that the -1 for unsuccessful rfind doesn't mess things up.
#NOTE: OF COURSE IT DOES, STOOPID!
last_name_start_index = string.rfind(clean_three, " ")
if last_name_start_index < 0:
last_name_start_index = 0
clean_four = clean_three[last_name_start_index:]
clean_four = string.strip(clean_four)
#Now the final loop that checks for alphabetical order
#We use the ASCII file table for that - the ord() function gives us the
#ASCII index. The ASCII-table is alphabetical by default (although it starts
#somewhere 90-ish).
#What about umlauts?
#Names containing umlauts cannot be alphabetical, because umlauts are not part
#of the alphabet!
#When python reads the lines, it converts Umlauts to something like xc3xa4
#Luckily, this means that all names containing umlauts will be discarded
#automatically, since comes before x in the ASCII table!
#How does the algorithm work?
#The idea is to examine the word letter by letter and when coming to letter
#that is lower in the alphabet than the previous, we flip the match=1 to
#match=0.
#The first letter has to be an 'a' at least!
last_character = ord('a')
match = 1
for i in range(len(clean_four)):
#Just in case ord() gives any errors, we use try: except: - but I don't
#think it's really necessary.
try:
current_character = ord(clean_four[i])
if current_character < last_character:
match = 0
last_character = current_character
except:
match = 0
#Anything that still has match==1 must be alphabetical, so we'll display it.
#I use repr('string') for output, because I want to check if there are any
#weird control characters left in the string.
if match == 1:
print repr(clean_one)