TAGS :Viewed: 17 - Published at: a few seconds ago
[ Extract HTML Links using Python ]
I am trying to extract the iframe src given a set of sites using Python. For instance, my input would be A.com, B.com, C.com and if each of these sites have iframes that link to D.com, E.com, F.com, ('None' if the site does not have an iframe) then I would like the output to be something of the form:
Site Iframe Src
A.com D.com
B.com E.com
C.com F.com
Currently, I have something like this:
from collections import defaultdict
import urllib2
import re
def PrintLinks(website):
counter = 0
regexp_link= regexp_link = r'''<frame src =((http|ftp)s?://.*?)'''
pattern = re.compile(regexp_link)
links = [None]*len(website)
for x in website:
html_page = urllib2.urlopen(website[counter])
html = html_page.read()
links[counter] = re.findall(pattern,html)
counter += 1
return links
def main():
website=["A.com","B.com","C.com"]
Is this the best way to do it and how would I get the output to be the format I would like? Thanks!
Answer 1
You don't need to reinvent the wheel using regex, there are awesome python packages that do that for you, being prob the most famous BeautifulSoup.
Install BeautifulSoup and httplib2 with pip, and try this
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
sites=['http://www.site1.com', 'http://www.site2.com', 'http://www.site3.com']
http = httplib2.Http()
for site in sites:
status, response = http.request(site)
for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
print site + ' ' + iframe['src']