TAGS :Viewed: 14 - Published at: a few seconds ago

[ Pull out content between two strings including strings ]

I am trying to do the following in python.

I have a file with the following content...

<VirtualHost>
  ServerName blah.com
  DocumentRoot /var/www/blah.com
</Virtualhost>

<VirtualHost>
  ServerName blah2.com
  DocumentRoot /var/www/blah2.com
</Virtualhost>

... etc

I want to take each of these virtual host containers and place them in a separate file(or variable and I can work from there)...

I have been able to take the data in between the strings but not including them. So the output would be ...

<VirtualHost>
  ServerName blah2.com
  DocumentRoot /var/www/blah2.com
</Virtualhost>

...iterated through each container and not...
ServerName blah2.com
DocumentRoot /var/www/blah2.com

Please let me know if this is something that can easily be done. Thanks!

Answer 1


A findall regex might work:

import re

d = """
<VirtualHost>
  ServerName blah.com
  DocumentRoot /var/www/blah.com
</Virtualhost>
<VirtualHost>
  ServerName blah2.com
  DocumentRoot /var/www/blah2.com
</Virtualhost>
"""

matches = re.findall(r'<VirtualHost>(.*?)</Virtualhost>', d, re.I|re.DOTALL)

#['\n  ServerName blah.com\n  DocumentRoot /var/www/blah.com\n',
# '\n  ServerName blah2.com\n  DocumentRoot /var/www/blah2.com\n']

Or to include the <VirtualHost> parts:

matches = re.findall(r'<VirtualHost>.*?</Virtualhost>', d, re.I|re.DOTALL)

#['<VirtualHost>\n  ServerName blah.com\n  DocumentRoot /var/www/blah.com\n</Virtualhost>',
# '<VirtualHost>\n  ServerName blah2.com\n  DocumentRoot /var/www/blah2.com\n</Virtualhost>']

Answer 2


Assuming that your input data is in XML, you can use minidom (as suggested by @Aesthete), or ElementTree:

import xml.dom.minidom as MD
import xml.etree.ElementTree as ET

input = """
<Document>
    <VirtualHost>
        ServerName blah.com
        DocumentRoot /var/www/blah.com
    </VirtualHost>
    <VirtualHost>
        ServerName blah2.com
        DocumentRoot /var/www/blah2.com
    </VirtualHost>
</Document>"""

domDoc = MD.parseString(input)
etreeDoc = ET.fromstring(input)

# list for Python 3.x
miniDomOutput = list(map(lambda f: f.toxml(), domDoc.getElementsByTagName('VirtualHost')))
elementTreeOutput = list(map(lambda f: ET.tostring(f), etreeDoc.findall('VirtualHost')))

print(miniDomOutput)
print(elementTreeOutput)

Outputs:

#['<VirtualHost>\n        ServerName blah.com\n        DocumentRoot /var/www/blah.com\n    </VirtualHost>', '<VirtualHost>\n        ServerName blah2.com\n        DocumentRoot /var/www/blah2.com\n    </VirtualHost>']
#[b'<VirtualHost>\n        ServerName blah.com\n        DocumentRoot /var/www/blah.com\n    </VirtualHost>\n    ', b'<VirtualHost>\n        ServerName blah2.com\n        DocumentRoot /var/www/blah2.com\n    </VirtualHost>\n']