python - Best practise to apply several rules on 1 string -

- June 15, 2011

i'm getting url string , need apply several rules it. first rule remove anchors, remove '../' notation, because urljoin joins url incorrect in cases, , remove leading slash. have such code:

def construct_url(parent_url, child_url):         url = urljoin(parent_url, child_url)         url = url.split('#')[0]         url = url.replace('../', '')         url = url.rstrip('/')         return url

but dont think best practise. think can done simpler. me please? thanks.

unfortunately, there isn't make function simpler here, since you're dealing pretty odd cases.

but can make more robust using python's urlparse.urlsplit() split url in well-defined components, processing, , put using urlparse.urlunsplit():

from urlparse import urljoin urlparse import urlsplit urlparse import urlunsplit  def construct_url(parent_url, child_url):     url = urljoin(parent_url, child_url)     scheme, netloc, path, query, fragment = urlsplit(url)     path = path.replace('../', '')     path = path.rstrip('/')     url = urlunsplit((scheme, netloc, path, query, ''))     return url   parent_url = 'http://user:pw@google.com' child_url = '../../../chrome/#foo'  print construct_url(parent_url, child_url)

output:

http://user:pw@google.com/chrome

using tools urlparse has advantage know processing operates on (path , fragment in case), , handles things user credentials, query strings, parameters etc. you.

note: contrary suggested in comments, urljoin in fact normalize urls:

>>> urlparse import urljoin >>> urljoin('http://google.com/foo/bar', '../qux') 'http://google.com/qux'

but strictly following rfc 1808.

from rfc 1808 section 5.2: abnormal examples:

within object well-defined base url of

base: <url:http://a/b/c/d;p?q#f>

[...]

parsers must careful in handling case there more relative path ".." segments there hierarchical levels in base url's path. note ".." syntax cannot used change <net_loc> of url.
../../../g    = <url:http://a/../g> ../../../../g = <url:http://a/../../g> 

so urljoin right thing preserving extraneous ../, therefore need remove them manual processing.

Search This Blog

Unity

python - Best practise to apply several rules on 1 string -

Comments

Post a Comment

Popular posts from this blog

angularjs - Showing an empty as first option in select tag -

qt - Change color of QGraphicsView rubber band -

c++ - Print Preview in Qt -