[ Remove all white spaces inside specific delimiters ]
I'm trying to process a xml file containing wrongly formed elements.
A wrongly formed elemement is one which doesn't respect the following pattern : <name attribute1=value1 attribute2=value2 ... attributeN=valueN>
There can be 0 to n attributes.
As a consequence, <my element number>
is invalid, while <my element=number>
is not.
Here is a sample of my text :
<product_name>
A high wind in Jamaica <The innocent voyage> The modern library of the world s best books Books Richard Arthur Warren Hughes
</product_name>
Here, <product_name>
is a good element, while <The innocent voyage>
is not.
When an incorrect element is spotted, I would like to have the <>
replaced with neutral characters, such as +
.
Since the file containing these tags is pretty big (1.5 GB), I would rather not use a brute force approach.
Would you guys see an fast (and if possible, elegant) way to solve this problem ?
Answer 1
As you state that you would rather stay away from regex
, I was able to create the following code that doesn't use regex
(although I'm sure regex
would be quite useful)
def valid_tag(tag):
temp = tag.split()
for word in temp[1:]:
if "=" not in word:
return False
return True
Here you pass in a tag as a string as the parameter. For example: "<hello test=test>"
You can run this test on each tag by creating another method for getting a tag by finding a "<"
and then the first ">"
that follows and creating a substring from that which will be the tag that you pass into this method.
NOTE: This assumes that your tags are written as follows: <hello test=test>
and not < hello test = test >
This method is still very primitive and makes a few assumptions as I stated above but hopefully it will give you the start you need.