python - Scrape alexa and show results in table in django -

- May 15, 2010

i want create simple (one page) web application using django, , see top 20 websites alexa.com/topsites/global. page should render table 21 rows (1 header , 20 websites) , 3 columns (rank, website , description).

my knowledge using django limitted , need if possible.

i've used template create table using bootstrap don't have idea on how parse: rank / website name / , description.

could lead me in right direction usefull websites / code snippets ?

i know have use htmlparser , implement like:

from htmlparser import htmlparser  # create subclass , override handler methods class myhtmlparser(htmlparser):     def handle_starttag(self, tag, attrs):         print "encountered start tag:", tag     def handle_endtag(self, tag):         print "encountered end tag :", tag     def handle_data(self, data):         print "encountered data  :", data  # instantiate parser , fed html parser = myhtmlparser() parser.feed('<html><head><title>test</title></head>'             '<body><h1>parse me!</h1></body></html>')

but don't know how use on requirements in application.

so, comming update. i've tried (just print results see if want) links.

any ?

import urllib2, htmlparser  class myhtmlparser(htmlparser.htmlparser):     def reset(self):         htmlparser.htmlparser.reset(self)         #count div rank of website         self.in_count_div = false         #description div description of website         self.in_description_div = false         #a tag url         self.in_link_a = false          self.count_items = none         self.a_link_items = none         self.description_items = none      def handle_starttag(self, tag, attrs):         if tag == 'div':             if('class', 'count') in attrs:                 self.in_count_div = true          if tag == 'a':             name, value in attrs:                 if name == 'href':                     self.a_link_items = [value,'']                     self.in_link_a = true                     break          if tag == 'div':             if('class', 'description') in attrs:                 self.in_description_div = true      #handle data each section     def handle_data_count(self, data):         if self.in_count_div:             self.count_items[1] += data      def handle_data_url(self, data):         if self.in_link_a:             self.a_link_items[1] += data      def handle_data_description(self, data):         if self.in_description_div:             self.description_items[1] += data      #endtag     def handle_endtag(self, tag):         if tag =='div':             if self.count_items not none:                 print self.count_items             self.count_items = none             self.in_count_div = false          if tag =='a':             if self.a_link_items not none:                 print self.a_link_items             self.a_link_items = none             self.in_link_a = false   if __name__ == '__main__':     myhtml = myhtmlparser()     myhtml.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())

if want api there 1 alexa here

if want scrape, i'd suggest beautifulsoup
(scrapy heavy since thing you'll doing reading 1 url.)

doing simple:

make python module deals pulling data alexa link using beautifulsoup, in module make runs task every 5 minutes or time span application efficient with, save database.
to display data retrieve database pass template in template variable, , html should (don't use tables):

<table>     {% site in top_20_sites %}     <tr>         <td>{{site.rank}}</td>         <td>{{site.name}}</td>         <td>{{site.description}}</td>     <\tr>     {% endfor %} </table>

as how scrape see awesome tutorial here

Search This Blog

Unity

python - Scrape alexa and show results in table in django -

Comments

Post a Comment

Popular posts from this blog

angularjs - Showing an empty as first option in select tag -

qt - Change color of QGraphicsView rubber band -

string - Writing a Java program to encrypt and decrypt a ADFGVX cipher -