TAGS :Viewed: 19 - Published at: a few seconds ago

[ parsing text file without splitting up multi-word names ]

I am trying to use a text file to take the numerical output from an xml and turn it into a name that is then written to a separate text file. my problem is that the name sometimes has spaces in it and I'm not sure how to deal with this.

The program access several txt and xml files online where items are referred to by a reference number. It compares info from these files to criteria i set and then adds the reference number to a list. What i am trying to do is convert this reference number to the associated name. I have a text file with the reference numbers and names and want to use the index() function to find the reference number and then write the associated name to a text file. I'm having trouble parsing it so that I can do the conversion. My problem is the list is written like this:

number    name\n
14        apple\n
27        anjou pear\n
36        asian pear\n
7645      langsat\n

so if I just use the .split() i end up with some of the names being split. I have tried replacing the white space between the numbers and names with a '\n' and splitting it at that but that didn't work either. If I replace the space with ' , ' and split over that I end up the names including the next lines number ['apple\n15'] which writes to text as two lines and leaves me with the same problem of not being able to split it over the white space...

any advice???

I have now implemented the Dict() function as suggested bellow which works ACCEPT that I still have the \n at the end of the name...

ttn = dict()
f=open('typeid2.txt', 'r')
    for line in f:
    number, name = line.split(None,1)
    ttn[number] = name

if I call ttn['14'] i get 'apple\n'

Answer 1


You can use re.findall() for this purpose.

input = "number    name\n14        apple\n27        anjou pear\n36        asian pear\n7645      langsat\n"
print re.findall("(\w+)\s+(.+)", input)

output:

[('number', 'name'), ('14', 'apple'), ('27', 'anjou pear'), ('36', 'asian pear'), ('7645', 'langsat')]

Answer 2


Regular expressions are very powerful and useful but it takes a lot to get used to using them with some authority. I would suggest instead that you stick with split here is the help info from split that describes how to use a maxsplit value to limit the number of splits.

Help on built-in function split:

split(...)
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

So for your code, assuming you have some lines to split

mytest = dict()
for each_line in data:
    number, name = line.split(None,1)
    mytest[number] = name

will return something like this

mytest {'27': 'anjou pear', '7645': 'langsat', 'number': 'name', '36': 'asian pear', '14': 'apple'} to access the help suppose you have some string mystring then just type

help(mystring.split)

The difference between my first attempt and this one was due to the comment below. In my first attempt the leading spaces on the name value were retained however, by using None, all white space characters were removed on the first split so this gets more specifically to what you are looking for.