Lösung Teil 2: Namen untersuchen

Posted on 13 Apr 2009
Jetzt kommt ein kleines python-Programm, dass den Rest erledigt. Ich häng es einfach mal an und hoffe, dass die Kommentare ausreichen, es zu verstehen:
(Ich würde es gerne als herunterladbare Datei posten, aber Wordpress lässt mich nicht!)
#!/usr/bin/env python

#A short script that reads a raw list of physicians and
#prints out the ones that have their last name with all 
#their characters in aphabetical ascending order, i.e.
#John Ace

import string

f = open("/home/martin/physician_list.txt","r")
lines = f.readlines()

#We'll examine the file line by line

for line in lines:
    
    #Step One: Remove all garbage like whitespaces and newlines in each line
    clean_one = string.strip(line)

    #Step Two: Remove all additional info in each line that may be added in 
    #parantheses (i.e.: John Doe (physician))
    #string.find() returns -1 when nothing is found!
    garbage_start_index = string.find(clean_one, "(")
    if garbage_start_index < 0:
        garbage_start_index = len(clean_one)    
    clean_two = clean_one[:garbage_start_index]

    #We will get a bit more white spaces, let's remove these again and make 
    #everything lowercase
    clean_three = string.strip(string.lower(clean_two))

    #Now lets get the last word in each string - assuming it's the last name
    #Let's hope that the -1 for unsuccessful rfind doesn't mess things up.
    #NOTE: OF COURSE IT DOES, STOOPID!
    last_name_start_index = string.rfind(clean_three, " ")
    if last_name_start_index < 0:
        last_name_start_index = 0   
    clean_four = clean_three[last_name_start_index:]
    clean_four = string.strip(clean_four)

    #Now the final loop that checks for alphabetical order
    #We use the ASCII file table for that - the ord() function gives us the 
    #ASCII index. The ASCII-table is alphabetical by default (although it starts 
    #somewhere 90-ish).
    #What about umlauts? 
    #Names containing umlauts cannot be alphabetical, because umlauts are not part
    #of the alphabet!
    #When python reads the lines, it converts Umlauts to something like xc3xa4 
    #Luckily, this means that all names containing umlauts will be discarded 
    #automatically, since  comes before x in the ASCII table!
    #How does the algorithm work?
    #The idea is to examine the word letter by letter and when coming to letter
    #that is lower in the alphabet than the previous, we flip the match=1 to 
    #match=0.
    #The first letter has to be an 'a' at least!
    last_character = ord('a')   
    match = 1   
    for i in range(len(clean_four)):
        #Just in case ord() gives any errors, we use try: except: - but I don't 
        #think it's really necessary.   
        try:
            current_character = ord(clean_four[i])          
            if current_character < last_character:
                match = 0
            last_character = current_character
        except:
            match = 0
    
    #Anything that still has match==1 must be alphabetical, so we'll display it.
    #I use repr('string')  for output, because I want to check if there are any
    #weird control characters left in the string.
    if match == 1:
        print repr(clean_one)