python - Best practise to apply several rules on 1 string -


i'm getting url string , need apply several rules it. first rule remove anchors, remove '../' notation, because urljoin joins url incorrect in cases, , remove leading slash. have such code:

def construct_url(parent_url, child_url):         url = urljoin(parent_url, child_url)         url = url.split('#')[0]         url = url.replace('../', '')         url = url.rstrip('/')         return url 

but dont think best practise. think can done simpler. me please? thanks.

unfortunately, there isn't make function simpler here, since you're dealing pretty odd cases.

but can make more robust using python's urlparse.urlsplit() split url in well-defined components, processing, , put using urlparse.urlunsplit():

from urlparse import urljoin urlparse import urlsplit urlparse import urlunsplit  def construct_url(parent_url, child_url):     url = urljoin(parent_url, child_url)     scheme, netloc, path, query, fragment = urlsplit(url)     path = path.replace('../', '')     path = path.rstrip('/')     url = urlunsplit((scheme, netloc, path, query, ''))     return url   parent_url = 'http://user:pw@google.com' child_url = '../../../chrome/#foo'  print construct_url(parent_url, child_url) 

output:

http://user:pw@google.com/chrome 

using tools urlparse has advantage know processing operates on (path , fragment in case), , handles things user credentials, query strings, parameters etc. you.


note: contrary suggested in comments, urljoin in fact normalize urls:

>>> urlparse import urljoin >>> urljoin('http://google.com/foo/bar', '../qux') 'http://google.com/qux' 

but strictly following rfc 1808.

from rfc 1808 section 5.2: abnormal examples:

within object well-defined base url of

base: <url:http://a/b/c/d;p?q#f>

[...]

parsers must careful in handling case there more relative path ".." segments there hierarchical levels in base url's path. note ".." syntax cannot used change <net_loc> of url.

../../../g    = <url:http://a/../g> ../../../../g = <url:http://a/../../g> 

so urljoin right thing preserving extraneous ../, therefore need remove them manual processing.


Comments

Popular posts from this blog

google chrome - Developer tools - How to inspect the elements which are added momentarily (by JQuery)? -

angularjs - Showing an empty as first option in select tag -

php - Cloud9 cloud IDE and CakePHP -