python - Scrape alexa and show results in table in django -
i want create simple (one page) web application using django, , see top 20 websites alexa.com/topsites/global. page should render table 21 rows (1 header , 20 websites) , 3 columns (rank, website , description).
my knowledge using django limitted , need if possible.
i've used template create table using bootstrap don't have idea on how parse: rank / website name / , description.
could lead me in right direction usefull websites / code snippets ?
i know have use htmlparser
, implement like:
from htmlparser import htmlparser # create subclass , override handler methods class myhtmlparser(htmlparser): def handle_starttag(self, tag, attrs): print "encountered start tag:", tag def handle_endtag(self, tag): print "encountered end tag :", tag def handle_data(self, data): print "encountered data :", data # instantiate parser , fed html parser = myhtmlparser() parser.feed('<html><head><title>test</title></head>' '<body><h1>parse me!</h1></body></html>')
but don't know how use on requirements in application.
so, comming update. i've tried (just print results see if want) links.
any ?
import urllib2, htmlparser class myhtmlparser(htmlparser.htmlparser): def reset(self): htmlparser.htmlparser.reset(self) #count div rank of website self.in_count_div = false #description div description of website self.in_description_div = false #a tag url self.in_link_a = false self.count_items = none self.a_link_items = none self.description_items = none def handle_starttag(self, tag, attrs): if tag == 'div': if('class', 'count') in attrs: self.in_count_div = true if tag == 'a': name, value in attrs: if name == 'href': self.a_link_items = [value,''] self.in_link_a = true break if tag == 'div': if('class', 'description') in attrs: self.in_description_div = true #handle data each section def handle_data_count(self, data): if self.in_count_div: self.count_items[1] += data def handle_data_url(self, data): if self.in_link_a: self.a_link_items[1] += data def handle_data_description(self, data): if self.in_description_div: self.description_items[1] += data #endtag def handle_endtag(self, tag): if tag =='div': if self.count_items not none: print self.count_items self.count_items = none self.in_count_div = false if tag =='a': if self.a_link_items not none: print self.a_link_items self.a_link_items = none self.in_link_a = false if __name__ == '__main__': myhtml = myhtmlparser() myhtml.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())
if want api there 1 alexa here
if want scrape, i'd suggest beautifulsoup
(scrapy heavy since thing you'll doing reading 1 url.)
doing simple:
- make python module deals pulling data alexa link using beautifulsoup, in module make runs task every 5 minutes or time span application efficient with, save database.
- to display data retrieve database pass template in template variable, , html should (don't use tables):
<table> {% site in top_20_sites %} <tr> <td>{{site.rank}}</td> <td>{{site.name}}</td> <td>{{site.description}}</td> <\tr> {% endfor %} </table>
as how scrape see awesome tutorial here
Comments
Post a Comment