TAGS :Viewed: 14 - Published at: a few seconds ago

[ Pulling text from file with python ]

I have a text file that I am opening in my python code. I want to search through the file and pull out the text that is followed by a specific symbol. For instance my text file name File.txt is:

Hello, this is just a dummy file that has information with no substance at all and I want to pull the information between the dollar sign symbols. So all of this $ in between here should be pulled out so I can do what ever I want to with it $ and the rest of this will be a second group.

Here is a sample of my code:

class FileExtract(object):
    __init__(self):
        pass

    def extractFile(self):
        file = open(File.txt)
        wholeFile = file.read()
        file.close()
        symCount = wholefile.count("$") 
        count = 0 #Will count the each $ as it finds it
        begin = False #determines which the $ has been found and begin to start copying word
        myWant = [] #will add the portion I want
        for word in wholeFile.split():
            while(count != symCount):
                if word != "$" and begin == False:
                    break
                if word == "$" and begin == False:
                    myWant.append(word)
                    begin = True
                    count = count + 1 #it found one of the total symbols
                    break
                elif word != "$" and begin == True:
                    myWant.append(word)
                    break
                elif word == "$" and begin == True:
                    begin = False
                    break
        print myWant

I would like for it to print:

"$ in between here should be pulled out so I can do what ever I want to with it" 
"$ and the rest of this will be a second group."

This is the only way I can think to pull the text out (which I know is horrible, please take it easy Im just learning). The problem is that my way is putting it into a list and I would like for it to just print the string out with spaces, newlines, and all. Any suggestions or other build in functions/methods that I am overlooking that would help me?

Answer 1


s = "Hello, this is just a dummy file that has information with no substance at all and I want to pull the information between the dollar sign symbols. So all of this $ in between here should be pulled out so I can do what ever I want to with it $ and the rest of this will be a second group."

a = s.split("$")[1:]
print a

http://ideone.com/tt9np

Of course, the delimiter won't be in the results, but it'd be trivial to append that yourself.

Answer 2


Well, you could do wholefile.split('$') and then have 3 element list: what's before the first $, the stuff between the $'s, and the stuff after the second $. (and no $'s.)

Or even print '\n$'.join(wholefile.split('$'))

And as a minimal function:

def extract_file(filename):
    return '\n$'.join(open(filename).read().split('$'))

Answer 3


This is what flex is for. You don't need to use flex to do the same thing in python though.

firstFlag = False
secondFlag = False
outputFile1 = open('first.txt', 'wb')
outputFile2 = open('second.txt', 'wb')
yourFile = open('thefile.txt', 'rb')
while True:
    char = yourFile.read(1)
    if not char:
        break
    if char == '$'
        if firstFlag:
            secondFlag = True
        firstFlag = True
    if firstFlag and not secondFlag:
        outputFile1.write(data)
    elif secondFlag:
        outputFile2.write(data)

Since this isn't native C code, it won't be exactly fast. I'd suggest you take a look at flex not just for the handy tool, but also for the learning experience.

The above code in flex:

%option 8bit outfile="scanner.c"
%option nounput nomain noyywrap
%option warn

%x First
%x Second
%%

. { ECHO; }
\$ { BEGIN First; yyout = fopen("first.txt", "wb"); }
<First>\$ { BEGIN Second; fclose(yyout); yyout = fopen("second.txt", "wb");}
<First>. { ECHO; }
<Second>. { ECHO; }

%%

The dot commands send each character to yyout, which starts pointing to stdout. To compile this:

flex -Cf scanner.l 
gcc -O -o flexer.exe scanner.c

It will take it's input from stdin.

Answer 4


It's actually quite simple. Without using split nor storing the results in a list:

def extractFile(self):
    file = open(File.txt)
    wholeFile = file.read()
    file.close()

    pos = wholeFile.find("$")
    while pos > 0:
        pos2 = wholeFile.find("$")

        if pos2 > 0:
            print wholeFile[pos:pos2]
        else:
            print wholeFile[pos:]
        pos = pos2