TAGS :Viewed: 22 - Published at: a few seconds ago

[ Remove all white spaces inside specific delimiters ]

I'm trying to process a xml file containing wrongly formed elements.

A wrongly formed elemement is one which doesn't respect the following pattern : <name attribute1=value1 attribute2=value2 ... attributeN=valueN>

There can be 0 to n attributes.

As a consequence, <my element number> is invalid, while <my element=number> is not.

Here is a sample of my text :

&lt;product_name&gt;
    A high wind in Jamaica &lt;The innocent voyage&gt;  The modern library of the world s best books   Books  Richard Arthur Warren Hughes
&lt;/product_name&gt;

Here, <product_name> is a good element, while <The innocent voyage> is not.

When an incorrect element is spotted, I would like to have the <> replaced with neutral characters, such as +.

Since the file containing these tags is pretty big (1.5 GB), I would rather not use a brute force approach.

Would you guys see an fast (and if possible, elegant) way to solve this problem ?

Answer 1


As you state that you would rather stay away from regex, I was able to create the following code that doesn't use regex (although I'm sure regex would be quite useful)

def valid_tag(tag):
    temp = tag.split()
    for word in temp[1:]:
        if "=" not in word:
            return False
    return True

Here you pass in a tag as a string as the parameter. For example: "<hello test=test>"

You can run this test on each tag by creating another method for getting a tag by finding a "<" and then the first ">" that follows and creating a substring from that which will be the tag that you pass into this method.

NOTE: This assumes that your tags are written as follows: <hello test=test> and not < hello test = test >

This method is still very primitive and makes a few assumptions as I stated above but hopefully it will give you the start you need.