[ Pull out content between two strings including strings ]
I am trying to do the following in python.
I have a file with the following content...
<VirtualHost>
ServerName blah.com
DocumentRoot /var/www/blah.com
</Virtualhost>
<VirtualHost>
ServerName blah2.com
DocumentRoot /var/www/blah2.com
</Virtualhost>
... etc
I want to take each of these virtual host containers and place them in a separate file(or variable and I can work from there)...
I have been able to take the data in between the strings but not including them. So the output would be ...
<VirtualHost>
ServerName blah2.com
DocumentRoot /var/www/blah2.com
</Virtualhost>
...iterated through each container and not...
ServerName blah2.com
DocumentRoot /var/www/blah2.com
Please let me know if this is something that can easily be done. Thanks!
Answer 1
A findall regex might work:
import re
d = """
<VirtualHost>
ServerName blah.com
DocumentRoot /var/www/blah.com
</Virtualhost>
<VirtualHost>
ServerName blah2.com
DocumentRoot /var/www/blah2.com
</Virtualhost>
"""
matches = re.findall(r'<VirtualHost>(.*?)</Virtualhost>', d, re.I|re.DOTALL)
#['\n ServerName blah.com\n DocumentRoot /var/www/blah.com\n',
# '\n ServerName blah2.com\n DocumentRoot /var/www/blah2.com\n']
Or to include the <VirtualHost>
parts:
matches = re.findall(r'<VirtualHost>.*?</Virtualhost>', d, re.I|re.DOTALL)
#['<VirtualHost>\n ServerName blah.com\n DocumentRoot /var/www/blah.com\n</Virtualhost>',
# '<VirtualHost>\n ServerName blah2.com\n DocumentRoot /var/www/blah2.com\n</Virtualhost>']
Answer 2
Assuming that your input data is in XML, you can use minidom (as suggested by @Aesthete), or ElementTree:
import xml.dom.minidom as MD
import xml.etree.ElementTree as ET
input = """
<Document>
<VirtualHost>
ServerName blah.com
DocumentRoot /var/www/blah.com
</VirtualHost>
<VirtualHost>
ServerName blah2.com
DocumentRoot /var/www/blah2.com
</VirtualHost>
</Document>"""
domDoc = MD.parseString(input)
etreeDoc = ET.fromstring(input)
# list for Python 3.x
miniDomOutput = list(map(lambda f: f.toxml(), domDoc.getElementsByTagName('VirtualHost')))
elementTreeOutput = list(map(lambda f: ET.tostring(f), etreeDoc.findall('VirtualHost')))
print(miniDomOutput)
print(elementTreeOutput)
Outputs:
#['<VirtualHost>\n ServerName blah.com\n DocumentRoot /var/www/blah.com\n </VirtualHost>', '<VirtualHost>\n ServerName blah2.com\n DocumentRoot /var/www/blah2.com\n </VirtualHost>']
#[b'<VirtualHost>\n ServerName blah.com\n DocumentRoot /var/www/blah.com\n </VirtualHost>\n ', b'<VirtualHost>\n ServerName blah2.com\n DocumentRoot /var/www/blah2.com\n </VirtualHost>\n']