tiny Go library to normalize URLs

Overview

Purell

Purell is a tiny Go library to normalize URLs. It returns a pure URL. Pure-ell. Sanitizer and all. Yeah, I know...

Based on the wikipedia paper and the RFC 3986 document.

build status

Install

go get github.com/PuerkitoBio/purell

Changelog

  • v1.1.1 : Fix failing test due to Go1.12 changes (thanks to @ianlancetaylor).
  • 2016-11-14 (v1.1.0) : IDN: Conform to RFC 5895: Fold character width (thanks to @beeker1121).
  • 2016-07-27 (v1.0.0) : Normalize IDN to ASCII (thanks to @zenovich).
  • 2015-02-08 : Add fix for relative paths issue (PR #5) and add fix for unnecessary encoding of reserved characters (see issue #7).
  • v0.2.0 : Add benchmarks, Attempt IDN support.
  • v0.1.0 : Initial release.

Examples

From example_test.go (note that in your code, you would import "github.com/PuerkitoBio/purell", and would prefix references to its methods and constants with "purell."):

package purell

import (
  "fmt"
  "net/url"
)

func ExampleNormalizeURLString() {
  if normalized, err := NormalizeURLString("hTTp://someWEBsite.com:80/Amazing%3f/url/",
    FlagLowercaseScheme|FlagLowercaseHost|FlagUppercaseEscapes); err != nil {
    panic(err)
  } else {
    fmt.Print(normalized)
  }
  // Output: http://somewebsite.com:80/Amazing%3F/url/
}

func ExampleMustNormalizeURLString() {
  normalized := MustNormalizeURLString("hTTpS://someWEBsite.com:443/Amazing%fa/url/",
    FlagsUnsafeGreedy)
  fmt.Print(normalized)

  // Output: http://somewebsite.com/Amazing%FA/url
}

func ExampleNormalizeURL() {
  if u, err := url.Parse("Http://SomeUrl.com:8080/a/b/.././c///g?c=3&a=1&b=9&c=0#target"); err != nil {
    panic(err)
  } else {
    normalized := NormalizeURL(u, FlagsUsuallySafeGreedy|FlagRemoveDuplicateSlashes|FlagRemoveFragment)
    fmt.Print(normalized)
  }

  // Output: http://someurl.com:8080/a/c/g?c=3&a=1&b=9&c=0
}

API

As seen in the examples above, purell offers three methods, NormalizeURLString(string, NormalizationFlags) (string, error), MustNormalizeURLString(string, NormalizationFlags) (string) and NormalizeURL(*url.URL, NormalizationFlags) (string). They all normalize the provided URL based on the specified flags. Here are the available flags:

const (
	// Safe normalizations
	FlagLowercaseScheme           NormalizationFlags = 1 << iota // HTTP://host -> http://host, applied by default in Go1.1
	FlagLowercaseHost                                            // http://HOST -> http://host
	FlagUppercaseEscapes                                         // http://host/t%ef -> http://host/t%EF
	FlagDecodeUnnecessaryEscapes                                 // http://host/t%41 -> http://host/tA
	FlagEncodeNecessaryEscapes                                   // http://host/!"#$ -> http://host/%21%22#$
	FlagRemoveDefaultPort                                        // http://host:80 -> http://host
	FlagRemoveEmptyQuerySeparator                                // http://host/path? -> http://host/path

	// Usually safe normalizations
	FlagRemoveTrailingSlash // http://host/path/ -> http://host/path
	FlagAddTrailingSlash    // http://host/path -> http://host/path/ (should choose only one of these add/remove trailing slash flags)
	FlagRemoveDotSegments   // http://host/path/./a/b/../c -> http://host/path/a/c

	// Unsafe normalizations
	FlagRemoveDirectoryIndex   // http://host/path/index.html -> http://host/path/
	FlagRemoveFragment         // http://host/path#fragment -> http://host/path
	FlagForceHTTP              // https://host -> http://host
	FlagRemoveDuplicateSlashes // http://host/path//a///b -> http://host/path/a/b
	FlagRemoveWWW              // http://www.host/ -> http://host/
	FlagAddWWW                 // http://host/ -> http://www.host/ (should choose only one of these add/remove WWW flags)
	FlagSortQuery              // http://host/path?c=3&b=2&a=1&b=1 -> http://host/path?a=1&b=1&b=2&c=3

	// Normalizations not in the wikipedia article, required to cover tests cases
	// submitted by jehiah
	FlagDecodeDWORDHost           // http://1113982867 -> http://66.102.7.147
	FlagDecodeOctalHost           // http://0102.0146.07.0223 -> http://66.102.7.147
	FlagDecodeHexHost             // http://0x42660793 -> http://66.102.7.147
	FlagRemoveUnnecessaryHostDots // http://.host../path -> http://host/path
	FlagRemoveEmptyPortSeparator  // http://host:/path -> http://host/path

	// Convenience set of safe normalizations
	FlagsSafe NormalizationFlags = FlagLowercaseHost | FlagLowercaseScheme | FlagUppercaseEscapes | FlagDecodeUnnecessaryEscapes | FlagEncodeNecessaryEscapes | FlagRemoveDefaultPort | FlagRemoveEmptyQuerySeparator

	// For convenience sets, "greedy" uses the "remove trailing slash" and "remove www. prefix" flags,
	// while "non-greedy" uses the "add (or keep) the trailing slash" and "add www. prefix".

	// Convenience set of usually safe normalizations (includes FlagsSafe)
	FlagsUsuallySafeGreedy    NormalizationFlags = FlagsSafe | FlagRemoveTrailingSlash | FlagRemoveDotSegments
	FlagsUsuallySafeNonGreedy NormalizationFlags = FlagsSafe | FlagAddTrailingSlash | FlagRemoveDotSegments

	// Convenience set of unsafe normalizations (includes FlagsUsuallySafe)
	FlagsUnsafeGreedy    NormalizationFlags = FlagsUsuallySafeGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagRemoveWWW | FlagSortQuery
	FlagsUnsafeNonGreedy NormalizationFlags = FlagsUsuallySafeNonGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagAddWWW | FlagSortQuery

	// Convenience set of all available flags
	FlagsAllGreedy    = FlagsUnsafeGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
	FlagsAllNonGreedy = FlagsUnsafeNonGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
)

For convenience, the set of flags FlagsSafe, FlagsUsuallySafe[Greedy|NonGreedy], FlagsUnsafe[Greedy|NonGreedy] and FlagsAll[Greedy|NonGreedy] are provided for the similarly grouped normalizations on wikipedia's URL normalization page. You can add (using the bitwise OR | operator) or remove (using the bitwise AND NOT &^ operator) individual flags from the sets if required, to build your own custom set.

The full godoc reference is available on gopkgdoc.

Some things to note:

  • FlagDecodeUnnecessaryEscapes, FlagEncodeNecessaryEscapes, FlagUppercaseEscapes and FlagRemoveEmptyQuerySeparator are always implicitly set, because internally, the URL string is parsed as an URL object, which automatically decodes unnecessary escapes, uppercases and encodes necessary ones, and removes empty query separators (an unnecessary ? at the end of the url). So this operation cannot not be done. For this reason, FlagRemoveEmptyQuerySeparator (as well as the other three) has been included in the FlagsSafe convenience set, instead of FlagsUnsafe, where Wikipedia puts it.

  • The FlagDecodeUnnecessaryEscapes decodes the following escapes (from -> to): - %24 -> $ - %26 -> & - %2B-%3B -> +,-./0123456789:; - %3D -> = - %40-%5A -> @ABCDEFGHIJKLMNOPQRSTUVWXYZ - %5F -> _ - %61-%7A -> abcdefghijklmnopqrstuvwxyz - %7E -> ~

  • When the NormalizeURL function is used (passing an URL object), this source URL object is modified (that is, after the call, the URL object will be modified to reflect the normalization).

  • The replace IP with domain name normalization (http://208.77.188.166/ → http://www.example.com/) is obviously not possible for a library without making some network requests. This is not implemented in purell.

  • The remove unused query string parameters and remove default query parameters are also not implemented, since this is a very case-specific normalization, and it is quite trivial to do with an URL object.

Safe vs Usually Safe vs Unsafe

Purell allows you to control the level of risk you take while normalizing an URL. You can aggressively normalize, play it totally safe, or anything in between.

Consider the following URL:

HTTPS://www.RooT.com/toto/t%45%1f///a/./b/../c/?z=3&w=2&a=4&w=1#invalid

Normalizing with the FlagsSafe gives:

https://www.root.com/toto/tE%1F///a/./b/../c/?z=3&w=2&a=4&w=1#invalid

With the FlagsUsuallySafeGreedy:

https://www.root.com/toto/tE%1F///a/c?z=3&w=2&a=4&w=1#invalid

And with FlagsUnsafeGreedy:

http://root.com/toto/tE%1F/a/c?a=4&w=1&w=2&z=3

TODOs

  • Add a class/default instance to allow specifying custom directory index names? At the moment, removing directory index removes (^|/)((?:default|index)\.\w{1,4})$.

Thanks / Contributions

@rogpeppe @jehiah @opennota @pchristopher1275 @zenovich @beeker1121

License

The BSD 3-Clause license.

Issues
  • This patch allows FlagRemoveDotSegments to work with relative links --

    This patch allows FlagRemoveDotSegments to work with relative links --

    i.e. URL's that have the Host field set to the empty string. Just to be clear, relative links will be stored in URL's, typically, if the user expects to later call URL.ResolveReference to create a full URL. That seems to be a relatively common use-case.

    opened by pchristopher1275 10
  • Add FlagRemoveQuery (strips entire query part)

    Add FlagRemoveQuery (strips entire query part)

    I won't be at all offended if you think this flag has no place in purell, but it's one I would use almost exclusively :- )

    context: I'm dealing with lots of news article URLs, and the unique part of the URL is almost always a slug, and the query used only for tracking (eg ?from=rssfeed ). Stripping the query is always my first step when normalizing URLs for comparison.

    eg (apologies for dailymail link): http://www.dailymail.co.uk/news/article-3656841/It-s-Democrats-end-25-hour-sit-Congress-gun-control-walk-cheering-crowd-without-vote-demanded.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490

    Just one of the many frustrations when trying to decide if two URLs are referring to the same article - there are loads more. Newspaper sites can commit some real atrocities when dealing with URLs!

    opened by bcampbell 4
  • Add flag to force the default scheme

    Add flag to force the default scheme

    Hi, I'd like a flag to force the default scheme (eg. http) when no scheme was given.

    str, _ := purell.NormalizeURLString("example.com/foo.html", purell.FlagsForceDefaultHttpScheme)
    
    if str == "http://example.com/foo.html" {
      // cool, I have "http" default scheme set now
    }
    

    Would it make sense? If yes, I'd be glad to submit a patch.

    opened by VojtechVitek 4
  • Opaque URLs should be normalized too

    Opaque URLs should be normalized too

    purell doesn't normalize "opaque" URLs (see documentation for url.URL).

    package main
    
    import (
        "fmt"
        "net/url"
    
        "github.com/PuerkitoBio/purell"
    )
    
    func main() {
        u := &url.URL{Scheme: "http", Opaque: "//eXAMPLe.com/%3f"}
        fmt.Println(purell.NormalizeURL(u, purell.FlagLowercaseHost|purell.FlagUppercaseEscapes))
    }
    

    Output: http://eXAMPLe.com/%3f

    opened by opennota 4
  • Pull code from urlesc repo

    Pull code from urlesc repo

    Pulled code from the urlesc repo because the repo has been archived. I can remove QueryEscape but just replacing urlesc.Escape with url.String changes the current behavior, so using urlesc for now.

    Signed-off-by: Yuki Okushi [email protected]

    opened by JohnTitor 3
  • Make normalizing functions public for external consumption

    Make normalizing functions public for external consumption

    I'd like to be able to use some of the normalizing functions in external project without the need of purell flags and NormalizeURL*() functions.

    Would you be OK with this change?

    opened by VojtechVitek 3
  • Reserved characters should not be percent-encoded

    Reserved characters should not be percent-encoded

    purell should not normalize reserved characters, as per RFC3986:

      reserved    = gen-delims / sub-delims
    
      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    
      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="
    
    package main
    
    import (
        "fmt"
    
        "github.com/PuerkitoBio/purell"
    )
    
    func main() {
        fmt.Println(purell.MustNormalizeURLString("my_(url)", purell.FlagsSafe))
    }
    

    The above code outputs my_%28url%29, whereas it should be my_(url). This is due to a bug in Go stdlib (issue 5684).

    opened by opennota 3
  • more tests, and some bugs

    more tests, and some bugs

    I've been interested in URL normalization for quite some time (as i should be since i work at bitly). As a result I've written code to do this in python, tcl, and written various incomplete bits of code that dreams of doing this right. So, i know where you are coming from.

    As i've been using golang more recently, i was happy to see your project, and hope that this will be one less thing that i'll need to do. To that end, i'm contributing test cases that i've found useful in helping improve my python library.

    It's not all that many test cases, but it highlights a few things I would love to see

    • IDNA normalization
    • percent unescaping (where possible) to unicode values
    • remove trailing dot's from domains

    beyond that, there are a few cases I think highlight bugs.

    Additionally, while i appreciate the configurability of some of the flags, It would be nice to se one function that just does the "right" thing up until some of the more editorial items (like dropping www, trailing slash, etc). It might just be me, but as a user that's what I'd like to see.

    It might be odd to open a pull request that just adds failing tests, but it's all i have time for tonight. when i get around to needing this in a project, i'll undoubtedly open up additional pull requests with some fixes.

    (also, i'm lazy, so i added a .travis.yml file and a badge to the readme; you should be able to go to travis-ci.org and turn on automated tests)

    opened by jehiah 3
  • A problem with the

    A problem with the "golang.org/x" packages in purell.go

    In file 'purell.go', some packages imported as: "golang.org/x/net/idna" "golang.org/x/text/unicode/norm" "golang.org/x/text/width"

    however, I find their real downloading urls are: https://github.com/golang/net https://github.com/golang/text

    so after I download them using 'go get' command, by default, paths '/github.com/golang/net' and '/github.com/golang/text/ are generated in my $GOPATH directory . I have to copy them into path 'golang.org/x', which is inconvenient,I think.

    why not import them directly as: "github.com/golang/net/idna" "github.com/golang/text/unicode/norm" "github.com/golang/text/width" ?

    opened by sdghchj 2
  • go1.1 test error

    go1.1 test error

    purell_test.go:680: running LowerScheme... purell_test.go:680: running LowerScheme2... purell_test.go:680: running LowerHost... purell_test.go:696: LowerHost - FAIL expected 'HTTP://www.src.ca/', got 'http://www.src.ca/' purell_test.go:680: running UpperEscapes... purell_test.go:680: running UnnecessaryEscapes... purell_test.go:680: running RemoveDefaultPort... purell_test.go:696: RemoveDefaultPort - FAIL expected 'HTTP://www.SRC.ca/', got 'http://www.SRC.ca/' purell_test.go:680: running RemoveDefaultPort2... purell_test.go:696: RemoveDefaultPort2 - FAIL expected 'HTTP://www.SRC.ca', got 'http://www.SRC.ca' purell_test.go:680: running RemoveDefaultPort3... purell_test.go:696: RemoveDefaultPort3 - FAIL expected 'HTTP://www.SRC.ca:8080', got 'http://www.SRC.ca:8080' ............. .............

    opened by ghost 2
  • Looking for a new maintainer

    Looking for a new maintainer

    Hello,

    After more than 8 years (according to the git logs), I'd like to hand over maintenance of this library. I don't use it personally, and haven't dedicated much care and attention to it in a long time. It would be best for someone with an interest in it (as in, someone that relies on this library as part of their project(s)) to take over.

    I believe the Hugo project uses this? If so, that could be a good fit, but anyone interested, please reach out!

    Thanks, Martin

    opened by mna 1
  • NormalizeURL does not perform IDNA normalization

    NormalizeURL does not perform IDNA normalization

    IDNA normalization is only performed in NormalizeURLString, but that function returns a string. If you need a url.URL, you must then parse the result of NormalizeURLString, which means you are parsing the URL yet again, which is wasteful.

    As NormalizeURLString calls NormalizeURL, the IDNA normalization in the former should be moved to the later, resulting in the URL passed to NormalizeURL having its host field IDNA normalized.

    opened by eliaslevy 1
Releases(v1.1.1)
A Go "clone" of the great and famous Requests library

GRequests A Go "clone" of the great and famous Requests library License GRequests is licensed under the Apache License, Version 2.0. See LICENSE for t

Levi Gross 1.9k Jun 26, 2022
Simple HTTP and REST client library for Go

Resty Simple HTTP and REST client library for Go (inspired by Ruby rest-client) Features section describes in detail about Resty capabilities Resty Co

Go Resty 6.3k Jun 26, 2022
A Go HTTP client library for creating and sending API requests

Sling Sling is a Go HTTP client library for creating and sending API requests. Slings store HTTP Request properties to simplify sending requests and d

Dalton Hubble 1.5k Jun 29, 2022
An HTTP proxy library for Go

Introduction Package goproxy provides a customizable HTTP proxy library for Go (golang), It supports regular HTTP proxy, HTTPS through CONNECT, and "h

Elazar Leibovich 4.8k Jun 26, 2022
httpreq is an http request library written with Golang to make requests and handle responses easily.

httpreq is an http request library written with Golang to make requests and handle responses easily. Install go get github.com/binalyze/http

Binalyze 56 Feb 10, 2022
Go library that makes it easy to add automatic retries to your projects, including support for context.Context.

go-retry Go library that makes it easy to add automatic retries to your projects, including support for context.Context. Example with context.Context

Conner Douglass 5 May 22, 2022
Cake is a lightweight HTTP client library for GO, inspired by Java Open-Feign.

Cake is a lightweight HTTP client library for GO, inspired by Java Open-Feign. Installation # With Go Modules, recommanded with go version > 1.16

snown 4 Mar 4, 2022
Parcel - HTTP rendering and binding library for Go

parcel HTTP rendering/binding library for Go Getting Started Add to your project

null 0 Jan 1, 2022
Httpx - a fast and multi-purpose HTTP toolkit allow to run multiple probers using retryablehttp library

httpx is a fast and multi-purpose HTTP toolkit allow to run multiple probers using retryablehttp library, it is designed to maintain the result reliability with increased threads.

Will Pape 0 Feb 3, 2022
Beta tool to normalize Orbit member data

Orbit Normalize Member Data Thanks for checking out my handy tool to work with Orbit's api. Everything is written in go and will continue to be update

Peter ONeill 2 Sep 16, 2021
Link converter service converts URLs to deeplinks or deeplinks to URLs.

Link converter Link converter service converts URLs to deeplinks or deeplinks to URLs. The service responds to the incoming request and first checks w

İlker Rişvan 0 Dec 23, 2021
Extract urls from text

xurls Extract urls from text using regular expressions. Requires Go 1.13 or later. import "mvdan.cc/xurls/v2" func main() { rxRelaxed := xurls.Relax

Daniel Martí 918 Jun 27, 2022
sigurls is a reconnaissance tool, it fetches URLs from AlienVault's OTX, Common Crawl, URLScan, Github and the Wayback Machine.

sigurls is a reconnaissance tool, it fetches URLs from AlienVault's OTX, Common Crawl, URLScan, Github and the Wayback Machine. DiSCLAIMER: fe

Alex Munene 128 May 22, 2021
Create ePub files from URLs

url2epub Create ePub files from URLs Overview The root directory provides a Go library that creates ePub files out of URLs, with limitations.

Yuxuan 'fishy' Wang 29 May 8, 2022
A tool to find redirection chains in multiple URLs

UnChain A tool to find redirection chains in multiple URLs Introduction UnChain automates process of finding and following `30X` redirects by extracti

RedCode Labs 80 Jun 10, 2022
Serve vanity URLs to Go tools.

goovus serves vanity URLs to Go tools. What's In A Name? go Made for Go. o Open as in open source. vus vanity url server. go + o + vus gives goovus. Q

null 5 Sep 28, 2021
A simple go program which checks if your websites are running and runs forever (stop it with ctrl+c). It takes two optional arguments, comma separated string with urls and an interval.

uptime A simple go program which checks if your websites are running and runs forever (stop it with ctrl+c). It takes two optional arguments: -interva

Markus Tenghamn 6 Jan 6, 2022
A tool to filter URLs by parameter count or size

GoFilter A tool to filter URLs by parameter count or size. This tool requires unique sorted URL list. For example: cat hosts.txt | sort -u > sorted &&

Ayberk ESER 7 Sep 10, 2021
Go pkg and cli tool to sign Google Maps API URLs

gmapsign gmapsign is a Go pkg and cli tool to sign Google Maps API request URLs. This is required when using: A client ID with the web service APIs, M

Sergio Garcez 4 Dec 22, 2021
urlhunter is a recon tool that allows searching on URLs that are exposed via shortener services such as bit.ly and goo.gl.

a recon tool that allows searching on URLs that are exposed via shortener services

Utku Sen 1.2k Jun 25, 2022
(en|de)code urls from the CLI

hakurlencode (en|de)code urls from the CLI Installation go get github.com/hakluke/hakurlencode Usage Pipe into the tool with no options to encode, fo

Luke Stephens (hakluke) 5 Jun 9, 2022
A web application attack surface mapping tool. It takes in a list of urls then performs numerous probes

sigurlscann3r A web application attack surface mapping tool. It takes in a list of urls then performs numerous probes Resources Features Installation

Signed Security 8 Nov 5, 2021
Parse any web page for URLs and return the HTTP response code of each one.

ParseWebPage - Fully Functional WebPage Parser Parse any web page for URLs and return the HTTP response code of each one. Creators ?? Steven Williams

null 0 Oct 25, 2021
🚀 goprobe is a promising command line tool for inspecting URLs with modern and user-friendly way.

goprobe Build go build -o ./bin/goprobe Example > goprobe https://github.com/gaitr/goprobe > cat links.txt | goprobe > echo "https://github.com/gaitr/

null 3 Oct 24, 2021
A passive reconnaissance tool for known URLs discovery

A passive reconnaissance tool for known URLs discovery - it gathers a list of URLs passively using various online sources.

Signed Security 177 Jun 24, 2022
WebWalker - Fast Script To Walk Web for find urls...

WebWalker send http request to url to get all urls in url and send http request to urls and again .... WebWalker can find 10,000 urls in 10 seconds.

WolvesLeader 1 Nov 28, 2021
This is a small tool designed to scrape one or more URLs given as command arguments.

HTTP-FETCH This is a small tool designed to scrape one or more URLs given as command arguments. Usage http-fetch [--metadata] ...URLs The output files

Daniel Sullivan 0 Nov 23, 2021
gup aka Get All Urls parameters to create wordlists for brute forcing parameters.

Description GUP is a tool to create wrodlists from the urls. Purpose The purpose of this tool is to create wordlists for brute forcing parameters. Ins

Chan Nyein Wai 14 Feb 25, 2022
Go program that fetches URLs concurrently and handles timeouts

fetchalltimeout This is an exercise of the book The Go Programming Language, by

Santiago Rodriguez 0 Dec 18, 2021