Extract urls from text



Go Reference

Extract urls from text using regular expressions. Requires Go 1.13 or later.

import "mvdan.cc/xurls/v2"

func main() {
	rxRelaxed := xurls.Relaxed()
	rxRelaxed.FindString("Do gophers live in golang.org?")  // "golang.org"
	rxRelaxed.FindString("This string does not have a URL") // ""

	rxStrict := xurls.Strict()
	rxStrict.FindAllString("must have scheme: http://foo.com/.", -1) // []string{"http://foo.com/"}
	rxStrict.FindAllString("no scheme, no match: foo.com", -1)       // []string{}

Since API is centered around regexp.Regexp, many other methods are available, such as finding the byte indexes for all matches.

Note that calling the exposed functions means compiling a regular expression, so repeated calls should be avoided.


To install the tool globally:

cd $(mktemp -d); go mod init tmp; GO111MODULE=on go get mvdan.cc/xurls/v2/cmd/xurls
$ echo "Do gophers live in http://golang.org?" | xurls
  • Support Arbitrary Protocols

    Support Arbitrary Protocols

    It looks like xurls only checks for https? which grabs http and https instead of allowing arbitrary protocols (e.g., file://, ftp://, steam://, etc.).

    Any interest in adding support for arbitrary protocols?

    opened by HalosGhost 36
  • Arch Linux PKGBUILDs separation

    Arch Linux PKGBUILDs separation

    Hi, just wanted to suggest an edit for the Arch Linux PKGBUILDs. Wouldn't it be better and cleaner if you separated the xurls package in xurls (uses the already compiled go releases from the Releases tab) and xurls-git for the latest upstream, please? That way other users won't have to install go just to use your pretty cool piece of software. :)

    Thank you!

    opened by pellettiero 17
  • Invalid prefixes for URLs are matched

    Invalid prefixes for URLs are matched

    This is probably due to adding support for arbitrary protocols.

    $ echo "systems.https://google.com" | xurls

    I am unsure of it, but is it actually ever valid for punctuation to exist in the protocol portion of a URL schema?

    I know that xurls wasn't really focusing on being a URL validator. But honestly, we aren't too far from accomplishing that and it would be helpful to know that matches are valid (in terms of the specification).

    opened by HalosGhost 10
  • Add file support

    Add file support

    Useful little util.

    It would be nice if it, in addition to the stdin support, could work with file(s) as arguments.

    Currently this works:

    cat myfile.txt | xurls

    This just hangs there, waiting:

    xurls myfile.txt

    An example of a Go tool that works as expected for a *nix CLI tool would be ccat.

    opened by bep 7
  • Email support

    Email support

    "Hi, this is my email [email protected]"

    This extracts example.com which isn't useful by its own. I would expect to have the complete email address or it's being skipped.

    What can be done for email addresses?

    opened by ferhatelmas 6
  • Improvement suggestion with multiple domains in one single URL.

    Improvement suggestion with multiple domains in one single URL.

    Hi, Thank's for providing us xurls.

    I came across the following case:

    $ echo "http://www.fakedomain.com/account/legitdomain.com" | bin/xurls -r

    I wonder if there is a easy (still fast) way for xurls to identify there are 2 "URLs" inside ? So this could possibly report something like:

    $ echo "http://www.fakedomain.com/account/legitdomain.com/folder" | bin/xurls -r

    Possibly by adding an additional option to support it on demand only.

    If there is a space in the string, both are found fine (expected and fine)

    echo "http://www.fakedomain.com/        account/legitdomain.com/folder" | bin/xurls -r

    This is only suggestion. If this impact performances badly, this is probably better to not implement.

    opened by uggyuggy 6
  • Error with Input containing long lines

    Error with Input containing long lines

    Hi, thank's for providing xurls.

    I came across the following error when input file contains quite long lines

    $ printf 'tototutu%.0s' {1..9000} > /tmp/a
    $ xurls -r  /tmp/a
    bufio.Scanner: token too long
    $ printf 'tototutu%.0s' {1..5000} > /tmp/b
    $ xurls -r /tmp/b

    Just wanted to report such strange case with long line could happen... As I'm not a good golang coder, It's better I'm not submitting PR.

    opened by uggyuggy 6
  • Matching returns wrong url

    Matching returns wrong url

    I'm playing with the library and tried it with a simple example, I'm surprised about the result: https://play.golang.org/p/4BF3UXE4x87

    Is it expected? Shouldn't "|" be treated as a wrong character?

    opened by jlory 6
  • Adding standard schemes with semiStrict and semiRelaxed option

    Adding standard schemes with semiStrict and semiRelaxed option

    Currently the scheme regex we're using is very generic (rightfully so according to https://tools.ietf.org/html/rfc3986#section-3.1) ; causing us to miss certain urls.

    The idea is to be able to catch urls that are hidden in obfuscated text, example:

    "aatesthttp://www.google.com should get me google's page"

    Having a set of standard schemes in our regex allows this.

    opened by skhare-r7 6
  • Add brackets to the allowed path chars

    Add brackets to the allowed path chars

    brackets are widely used in paths, so add them

    http://bgp.he.net/search?search[search]=vortex.data.microsoft.com&commit=Search was not matched for example

    opened by sztanpet 6
  • Dangling dots, mid-string, are seen as domains

    Dangling dots, mid-string, are seen as domains

    Here I have two small edge-cases:

    • <[email protected]> yields []string{"some.gu", "domain.com"}
    • [cid:programmer-thumb-shield-32x32.v2_fe0f1423-2d7d-484b-b624-6b7545ab4311.png] yields []sting{"fe0f1423-2d7d-484b-b624-6b7545ab4311.pn"}

    I'm just wondering about the dropped character before the symbol. This is email, so I can cross-reference against the filenames of inline attachments and also double-check against a known list of TLDs, but dropping that last character makes this difficult.

    Any ideas on why that last char is being dropped?

    opened by requaos 5
  • make a deterministic variant of

    make a deterministic variant of "go generate" and have CI check it's up to date

    To prevent issues like https://github.com/mvdan/xurls/pull/67 in the future.

    Two changes should be done:

    1. Use clearer filenames for generated files, so they stand out in file change summaries. For example, schemes_gen.go rather than schemes.go.

    2. Split go generate into two phases; one to download the latest TLD and scheme lists from the internet and write them to files in the git repo (but outside the module zip), and another to take those files and generate the code. The default go generate would do both, but we would add a go generate -tags=noupdate to only do the second. CI would enforce the latter has an empty git diff.

    opened by mvdan 0
  • Issue with Email Addresses

    Issue with Email Addresses

    I am using the xurls code to pull out possible urls from a message body string. The urls can be in either strict or relaxed format so I need to use the relaxed method of xurls to find the possible urls in the string. The issue is that email addresses can also be in the string and the relaxed method of xurls is pulling those out too.

    For example my string might be: "Hello from http://www.google.com, please check the www.test.com webpage for further information. If you have any questions please email [email protected] or [email protected]"

    What I would like xurls to do is just pull the http://www.google.com or www.test.com.

    Instead is pulls the 2 urls, and John.Sm, test.com, test.com. Is there anything that can be done so that only urls are pulled?

    opened by JimmyGalar 5
Daniel Martí
I work on stuff in Go.
Daniel Martí
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Charles Banning 537 Dec 29, 2022
A general purpose application and library for aligning text.

align A general purpose application that aligns text The focus of this application is to provide a fast, efficient, and useful tool for aligning text.

John Moore 78 Sep 27, 2022
Parse placeholder and wildcard text commands

allot allot is a small Golang library to match and parse commands with pre-defined strings. For example use allot to define a list of commands your CL

Sebastian Müller 55 Nov 24, 2022
Guess the natural language of a text in Go

guesslanguage This is a Go version of python guess-language. guesslanguage provides a simple way to detect the natural language of unicode string and

Nikita Vershinin 56 Dec 26, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

JF Technology 532 Jan 4, 2023
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

ZoomIO 26 Dec 19, 2022
Easy AWK-style text processing in Go

awk Description awk is a package for the Go programming language that provides an AWK-style text processing capability. The package facilitates splitt

Scott Pakin 94 Jul 25, 2022
Change the color of console text.

go-colortext package This is a package to change the color of the text and background in the console, working both under Windows and other systems. Un

Yi Deng 215 Oct 26, 2022
Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Michał Derkacz 74 Mar 15, 2022
Package sanitize provides functions for sanitizing text in golang strings.

sanitize Package sanitize provides functions to sanitize html and paths with go (golang). FUNCTIONS sanitize.Accents(s string) string Accents replaces

Kenny Grant 322 Dec 5, 2022
Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Bill Burdick 27 Jul 30, 2022
text to speech bot for discord

text to speech bot for discord

takanakahiko 20 Oct 1, 2022
A diff3 text merge implementation in Go

Diff3 A diff3 text merge implementation in Go based on the awesome paper below. "A Formal Investigation of Diff3" by Sanjeev Khanna, Keshav Kunal, and

Keenan Nemetz 20 Nov 5, 2022
gomtch - find text even if it doesn't want to be found

gomtch - find text even if it doesn't want to be found Do your users have clever ways to hide some terms from you? Sometimes it is hard to find forbid

Nicolas Augusto Sassi 28 Sep 28, 2022
Unified text diffing in Go (copy of the internal diffing packages the officlal Go language server uses)

gotextdiff - unified text diffing in Go This is a copy of the Go text diffing packages that the official Go language server gopls uses internally to g

Hexops 96 Dec 26, 2022
Convert scanned image PDF file to text annotated PDF file

Jisui (自炊) This tool is PoC (Proof of Concept). Jisui is a helper tool to create e-book. Ordinary the scanned book have not text information, so you c

Takumasa Sakao 28 Dec 11, 2022
A modern text indexing library for go

bleve modern text indexing in go - blevesearch.com Features Index any go data structure (including JSON) Intelligent defaults backed up by powerful co

bleve 8.8k Jan 4, 2023
Paranoid text spacing in Go (Golang)

pangu.go Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Vinta Chen 85 Oct 15, 2022
Diff, match and patch text in Go

go-diff go-diff offers algorithms to perform operations required for synchronizing plain text: Compare two texts and return their differences. Perform

Sergi Mansilla 1.4k Dec 25, 2022