TAGS :Viewed: 17 - Published at: a few seconds ago

[ Extract HTML Links using Python ]

I am trying to extract the iframe src given a set of sites using Python. For instance, my input would be A.com, B.com, C.com and if each of these sites have iframes that link to D.com, E.com, F.com, ('None' if the site does not have an iframe) then I would like the output to be something of the form:

Site    Iframe Src
A.com    D.com
B.com    E.com
C.com    F.com

Currently, I have something like this:

from collections import defaultdict
import urllib2
import re

 def PrintLinks(website):
 counter = 0
 regexp_link= regexp_link = r'''<frame src =((http|ftp)s?://.*?)'''
 pattern = re.compile(regexp_link)
 links = [None]*len(website)
 for x in website:
     html_page = urllib2.urlopen(website[counter])
     html = html_page.read()
     links[counter] = re.findall(pattern,html)
     counter += 1
 return links

def main():
 website=["A.com","B.com","C.com"]

Is this the best way to do it and how would I get the output to be the format I would like? Thanks!

Answer 1


You don't need to reinvent the wheel using regex, there are awesome python packages that do that for you, being prob the most famous BeautifulSoup.

Install BeautifulSoup and httplib2 with pip, and try this


import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

sites=['http://www.site1.com', 'http://www.site2.com', 'http://www.site3.com']
http = httplib2.Http()

for site in sites:
    status, response = http.request(site)
    for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
        print site + ' ' + iframe['src']