A full-featured regex engine in pure Go based on the .NET engine

Overview

regexp2 - full featured regular expressions for Go

Regexp2 is a feature-rich RegExp engine for Go. It doesn't have constant time guarantees like the built-in regexp package, but it allows backtracking and is compatible with Perl5 and .NET. You'll likely be better off with the RE2 engine from the regexp package and should only use this if you need to write very complex patterns or require compatibility with .NET.

Basis of the engine

The engine is ported from the .NET framework's System.Text.RegularExpressions.Regex engine. That engine was open sourced in 2015 under the MIT license. There are some fundamental differences between .NET strings and Go strings that required a bit of borrowing from the Go framework regex engine as well. I cleaned up a couple of the dirtier bits during the port (regexcharclass.cs was terrible), but the parse tree, code emmitted, and therefore patterns matched should be identical.

Installing

This is a go-gettable library, so install is easy:

go get github.com/dlclark/regexp2/...

Usage

Usage is similar to the Go regexp package. Just like in regexp, you start by converting a regex into a state machine via the Compile or MustCompile methods. They ultimately do the same thing, but MustCompile will panic if the regex is invalid. You can then use the provided Regexp struct to find matches repeatedly. A Regexp struct is safe to use across goroutines.

re := regexp2.MustCompile(`Your pattern`, 0)
if isMatch, _ := re.MatchString(`Something to match`); isMatch {
    //do something
}

The only error that the *Match* methods should return is a Timeout if you set the re.MatchTimeout field. Any other error is a bug in the regexp2 package. If you need more details about capture groups in a match then use the FindStringMatch method, like so:

if m, _ := re.FindStringMatch(`Something to match`); m != nil {
    // the whole match is always group 0
    fmt.Printf("Group 0: %v\n", m.String())

    // you can get all the groups too
    gps := m.Groups()

    // a group can be captured multiple times, so each cap is separately addressable
    fmt.Printf("Group 1, first capture", gps[1].Captures[0].String())
    fmt.Printf("Group 1, second capture", gps[1].Captures[1].String())
}

Group 0 is embedded in the Match. Group 0 is an automatically-assigned group that encompasses the whole pattern. This means that m.String() is the same as m.Group.String() and m.Groups()[0].String()

The last capture is embedded in each group, so g.String() will return the same thing as g.Capture.String() and g.Captures[len(g.Captures)-1].String().

If you want to find multiple matches from a single input string you should use the FindNextMatch method. For example, to implement a function similar to regexp.FindAllString:

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	m, _ := re.FindStringMatch(s)
	for m != nil {
		matches = append(matches, m.String())
		m, _ = re.FindNextMatch(m)
	}
	return matches
}

FindNextMatch is optmized so that it re-uses the underlying string/rune slice.

The internals of regexp2 always operate on []rune so Index and Length data in a Match always reference a position in runes rather than bytes (even if the input was given as a string). This is a dramatic difference between regexp and regexp2. It's advisable to use the provided String() methods to avoid having to work with indices.

Compare regexp and regexp2

Category regexp regexp2
Catastrophic backtracking possible no, constant execution time guarantees yes, if your pattern is at risk you can use the re.MatchTimeout field
Python-style capture groups (?P<name>re) yes no (yes in RE2 compat mode)
.NET-style capture groups (?<name>re) or (?'name're) no yes
comments (?#comment) no yes
branch numbering reset (?|a|b) no no
possessive match (?>re) no yes
positive lookahead (?=re) no yes
negative lookahead (?!re) no yes
positive lookbehind (?<=re) no yes
negative lookbehind (?<!re) no yes
back reference \1 no yes
named back reference \k'name' no yes
named ascii character class [[:foo:]] yes no (yes in RE2 compat mode)
conditionals (?(expr)yes|no) no yes

RE2 compatibility mode

The default behavior of regexp2 is to match the .NET regexp engine, however the RE2 option is provided to change the parsing to increase compatibility with RE2. Using the RE2 option when compiling a regexp will not take away any features, but will change the following behaviors:

  • add support for named ascii character classes (e.g. [[:foo:]])
  • add support for python-style capture groups (e.g. (P<name>re))
  • change singleline behavior for $ to only match end of string (like RE2) (see #24)
re := regexp2.MustCompile(`Your RE2-compatible pattern`, regexp2.RE2)
if isMatch, _ := re.MatchString(`Something to match`); isMatch {
    //do something
}

This feature is a work in progress and I'm open to ideas for more things to put here (maybe more relaxed character escaping rules?).

Library features that I'm still working on

  • Regex split

Potential bugs

I've run a battery of tests against regexp2 from various sources and found the debug output matches the .NET engine, but .NET and Go handle strings very differently. I've attempted to handle these differences, but most of my testing deals with basic ASCII with a little bit of multi-byte Unicode. There's a chance that there are bugs in the string handling related to character sets with supplementary Unicode chars. Right-to-Left support is coded, but not well tested either.

Find a bug?

I'm open to new issues and pull requests with tests if you find something odd!

Comments
  • Performance issue matching against beginning of very large string

    Performance issue matching against beginning of very large string

    I am tokenizing some text by matching a set of regexes against the beginning of a string holding the contents of a file. I noticed that regexp2 was extremely slow for this use-case, and after running the profiler found that the time was dominated by getRunes().

    This is occurring because, before every match, regexp2 converts the entire 22kb string to a slice of runes. I've worked around the issue be pre-converting the string to a slice of runes myself, then using FindRulesMatch(), but it was quite surprising and non-obvious.

    A solution would be to convert runes on the fly (as most matches are under 10 characters, converting the whole string each time is redundant). Looking at the code, it doesn't seem like it would super painful to achieve. The runner would need to be modified to use DecodeRuneInString to advance the index into the string, rather than a direct index into a slice of runes.

    opened by alecthomas 26
  • Seems to fail a positive lookahead

    Seems to fail a positive lookahead

    Hello, I was checking it out and it seems to fail a regular expression. For a given text like this one, the expression ((Art\.\s\d+)[\S\s]*?(?=Art\.\s\d+)) fails to match every Art. block in the text. I've tested the expression on this website and there it gives me the correct count of 12 matches.

    Am I missing something? Maybe a multiline flag?

    opened by rafaelcn 8
  • Bulk replace

    Bulk replace

    Hello,

    I'd just like to ask you if you have any plans to implement bulk replace functions to your regexp2 as the Go standard regex? https://golang.org/pkg/regexp/#Regexp.ReplaceAll

    • func (re *Regexp) ReplaceAll(src, repl []byte) []byte
      
    • func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
      
    • func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
      
    • func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
      
    • func (re *Regexp) ReplaceAllString(src, repl string) string
      
    • func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string
      

    Thank you,

    opened by hachi8833 6
  • Regex Multiline

    Regex Multiline

    a regex= ^(ac|bb)$\n, but this i dont use option Multiline,I think it will error when MustCompile,but it not ,and can match string "ac\n",so how can i do ,it will throw an error

    opened by SmallSmartMouse 4
  • Improve ECMAScript compatibility.

    Improve ECMAScript compatibility.

    Hi,

    This PR includes a couple of fixes to improve ECMAScript compatibility. The added test cases illustrate the issues fixed. Please consider merging.

    opened by dop251 4
  • Error while trying to match a string with a specific unicode against a RegExp that contains a space and a group

    Error while trying to match a string with a specific unicode against a RegExp that contains a space and a group

    When trying to match (phrase.MatchString(X)) messages like gg 󠀀 󠀀 (notice that these are not the regular spaces) against a phrase like regexp2.MustCompile("\\bcool (house)\\b", 0), the following error will be thrown:

    panic: runtime error: index out of range [917504] with length 128
    
    goroutine 1 [running]:
    github.com/dlclark/regexp2/syntax.(*BmPrefix).Scan(0xc000180540, {0xc000b70948, 0x6, 0x0?}, 0x0?, 0x0, 0x6)
            C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/syntax/prefix.go:716 +0x3bb
    github.com/dlclark/regexp2.(*runner).findFirstChar(0xc000623a00)
            C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:1305 +0x366
    github.com/dlclark/regexp2.(*runner).scan(0xc000623a00, {0xc000b70948?, 0x6, 0xc000b70948?}, 0x6?, 0x1, 0xc00008f8e8?)
            C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:130 +0x1e5
    github.com/dlclark/regexp2.(*Regexp).run(0xc0000f6200, 0xf4?, 0xffffffffffffffff, {0xc000b70948, 0x6, 0x6})
            C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:91 +0xfa
    github.com/dlclark/regexp2.(*Regexp).MatchString(0x10f9c40?, {0x108f0f4?, 0xc00008fb48?})
            C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/regexp.go:213 +0x45
    main.main()
            C:/Users/X/Desktop/GoRegExTests/test.go:127 +0xbdc
    

    The error is only being thrown when: a. The message contains those unicode characters b. The RegExp contains a space and a group like (house)

    The RegExp above is just a very basic example to demonstrate this problem.

    opened by beatzzz 3
  • Improved the handling of named group references in ECMAScript mode.

    Improved the handling of named group references in ECMAScript mode.

    I have made a few changes to support named group references according to the modern ECMAScript specification. The changes only affect ECMAScript mode except one: the invalid references now cause errors whereas previously they were ignored. I've checked and the new behavior seems to match perl and .NET online regex tester (http://regexstorm.net/tester).

    Please consider merging.

    opened by dop251 3
  • Licensing and specific ATTRIB details

    Licensing and specific ATTRIB details

    As part as an effort that includes packaging your library for Debian, I'm wondering if it would be possible to have more details or information about which particular files are covered by each original license?

    In particular, could you provide some more details regarding these comments on ATTRIB:

    Some of this code is ported from dotnet/corefx, which was released under this license: ...

    Small pieces of code are copied from the Go framework under this license: ...

    I am aware it might be a bit difficult to retrieve that history, but any insight would be much appreciated in the hopes of making sure licenses and copyright are attributed as faithfully as possible. Thanks in advance!

    opened by diego-plan9 3
  • Problems with Negative Lookahead

    Problems with Negative Lookahead

    re := regexp2.MustCompile(`(?m)^.*(?!/bin/bash)$`,0)
    match,_ := re.FindStringMatch(string(passwd))
    

    I'm trying to take all the string execpt the ones containing /bin/bash but actually the result is just the first line of /etc/passwd that contains /bin/bash

    opened by Nhoya 3
  • Continuous 4byte emoji would crash when ReplaceFunc()

    Continuous 4byte emoji would crash when ReplaceFunc()

    Hello, it's been a long time.

    Today I found an issue regarding some special "4byte" emojis on ReplaceFunc().

    • sample 4byte emojis: 📍😏️📣🍣🍺
    • sample 3byte emoji: ✔️⚾️

    You can inspect the above with http://r12a.github.io/apps/conversion/ like the following:

    image

    Sample1: causes panic

    Please take a look at the following: You can reproduce the issue by uncommenting the str assignment lines one by one.

    As far as I checked, ReplaceFunc()'d get panic under the following condition:

    • target contains some continuous 4byte emojis, and
    • regex contains 3bytes UTF-8 characters and contains NO 4byte emojis
    package main
    
    import (
    	"github.com/dlclark/regexp2"
    	"github.com/k0kubun/pp"
    )
    
    func main() {
    	str := "高" // panic: Japanese Kanji
    	// str := "は" // panic: Japanese Hiragana
    	// str := "パ" // panic: Japanese Katakana
    	// str := "[a-zA-Z0-9]{,2}" // works fine: Japanese Hiragana
    	// str := "峰起|烽起" // works fine: longer Japanese Hiragana (I wonder why)
    	// str := "フトレス" // panic: longer Japanese Katakana
    	// str := "ALLWAYS|Allways|allways|AllWays" // works fine: Alphabet
    	// str := "📍" // works fine: 4byte emoji
    	// str := "📍📍" // works fine: continuous 4byte emoji
    	// str := "✔️" // panic: 3byte emoji
    	// str := "✔️✔️" // panic: coutinuous 3byte emoji
    	// str := "📍️✔️" // works fine: 4 and 3byte emoji
    	// str := "️✔📍️" // works fine: 3 and 4byte emoji
    	// str := "📍️は️" // works fine: 4byte emoji and Hiragana
    	// str := "️は📍️" // works fine: Hiragana and 4byte emoji
    
    	re := regexp2.MustCompile(str, 0)
    	result, _ := re.ReplaceFunc("📍✔️😏⚾️📣🍣🍺🍺 <- continuous 4byte emoji 寿司ビール文字あり", func(m regexp2.Match) string {
    		return "࿗" + "࿘" + string(m.Capture.Runes()) + "࿌"
    	}, -1, -1)
    
    	pp.Println(result)
    }
    

    Sample2: all works fine

    The following is a kind of control group that works fine. The key is that the target contains no "continuous 4byte emojis".

    package main
    
    import (
    	"github.com/dlclark/regexp2"
    	"github.com/k0kubun/pp"
    )
    
    func main() {
            // All of the following patterns work fine perhaps because ""✔✔⚾⚾️ <- 3byte emoji 寿司ビール文字なし" contains no continuous 4byte emojis. You can check them by uncommenting them one by one.
    	str := "高"
    	// str := "は"
    	// str := "パ"
    	// str := "[a-zA-Z0-9]{,2}"
    	// str := "峰起|烽起"
    	// str := "フトレス"
    	// str := "ALLWAYS|Allways|allways|AllWays"
    	// str := "📍" 
    	// str := "📍📍" 
    	// str := "✔️" 
    	// str := "✔️✔️" 
    	// str := "📍️✔️" 
    	// str := "️✔📍️" 
    	// str := "📍️は️" 
    	// str := "️は📍️" 
    
    	re := regexp2.MustCompile(str, 0)
           // The following target works fine: there's no continuous 4byte emojis
    	result, _ := re.ReplaceFunc("✔✔⚾⚾️ <- 3byte emoji 寿司ビール文字なし", func(m regexp2.Match) string {
    		return "࿗" + "࿘" + string(m.Capture.Runes()) + "࿌"
    	}, -1, -1)
    
    	pp.Println(result)
    }
    

    FYI

    The issue looks a little bit similar to "sushi-beer" issue: https://gist.github.com/kamipo/37576ce436c564d8cc28

    I hope you'd check and fix it.

    Best regards, 🙇

    opened by hachi8833 3
  • bugs in scenarios of Chinese characters or incorrect using of match.Index

    bugs in scenarios of Chinese characters or incorrect using of match.Index

    the following codes fails

    package main
    
    import (
    	"fmt"
    	"github.com/dlclark/regexp2"
    )
    
    func main()  {
    	regex := regexp2.MustCompile("<style", regexp2.IgnoreCase|regexp2.Singleline)
    	match, err := regex.FindStringMatch(sample)
    	if err != nil {
    		panic(err)
    	}
    	if match != nil {
    		t, err := regex.Replace(sample, "xxx", match.Index, -1)
    		if err != nil {
    			panic(err)
    		}
    		fmt.Printf("%s", t)
    	}
    }
    
    var sample = "<title>错<style"
    

    if i search some words/regex successfully, and then replace something from match.Index instead of -1, the codes fails.

    however, if removed the Chinese character 错, the codes succeeds.

    so, in such scenario, what should beginning index be if I want to replace all and don't want to replace from -1(begining)

    opened by nako-ruru 2
  • error parsing regexp: unrecognized grouping construct: (?-1

    error parsing regexp: unrecognized grouping construct: (?-1

    package parse
    
    import (
    	"fmt"
    	"github.com/dlclark/regexp2"
    	"testing"
    )
    
    func TestJsonRe2(t *testing.T) {
    	text := `{
      "code" : "0",
      "message" : "success",
      "responseTime" : 2,
      "traceId" : "a469b12c7d7aaca5",
      "returnCode" : null,
      "result" : {
        "total" : 0,
        "list" : [ ]
    }
    }`
    	reg := `/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/`
    	r, err := regexp2.Compile(reg, regexp2.RE2|regexp2.Multiline|regexp2.ECMAScript)
    	if err != nil {
    		fmt.Println(err)
    		return
    	}
    
    	matchedStrings, err := r.FindStringMatch(text)
    	if err != nil {
    		fmt.Println(err)
    		return
    	}
    	fmt.Println(matchedStrings)
    }
    
    

    output:

    error parsing regexp: unrecognized grouping construct: (?-1 in `/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/`
    

    but in https://regex101.com/,it is ok image

    opened by shadow1ng 0
  • fix: ecma ranges with set terminator

    fix: ecma ranges with set terminator

    Fix ECMAScript un-escaped literal '-' when followed by predefined character sets.

    Also:

    • Fixed missing error check on parseProperty() call.
    • Use addChar(ch) helper instead of addRange(ch, ch).

    Fixes #54

    opened by stevenh 4
  • ecmascript: cannot include class \s in character range

    ecmascript: cannot include class \s in character range

    When compiling using regexp2.ECMAScript the regexp [a-\s] fails with the following but it should pass:

    error parsing regexp: cannot include class \115 in character range in `[a-\s]`
    

    regexp101 shows how it should be interpreted.

    opened by stevenh 0
  • Is it possible to get the name of the currently matched group?

    Is it possible to get the name of the currently matched group?

    Say I have a regex to tokenize some language..

    # in python.
    regex = re.compile(
        "(?P<comment>#.*?$)|"
        "(?P<newline>\n)|"     # has to go ahead of the whitespace
        "(?P<comma>,)|"       
        "(?P<double_quote_string>\".*?\")|" 
        "(?P<single_quote_string>'.*?')|"   
        "(?P<whitespace>[ \t\r\f\v]+)|"    ... etc
    
    

    Here you expect to get multiple matches for each group name when tokenizing a file and you want to keep the ordering of the tokens.

    If I use the same approach using regexp2 can I go from match to group name? E.g. how do I get the last matched group name for a match? Is that possible?

    opened by dementedhedgehog 0
  • Add support for Perl(PCRE) named and unnamed group capturing order

    Add support for Perl(PCRE) named and unnamed group capturing order

    In other words maintain the order of capture groups. With the MaintainCaptureOrder regexp option.

    I also added inline option o. It's useful if you only have access to the pattern, but not the regex. But it only is useful if used at the start of the pattern, I couldn't find a way to prevent it from being used elsewhere.

    I've never liked nor have been good with bitwise operations. So I don't know if I should've picked another number.

    opened by CIAvash 3
  • Can regexp2 provide the same APIs adapt to std.regexp?

    Can regexp2 provide the same APIs adapt to std.regexp?

    I wander if regexp2 can provide the same APIs adapt to std.regexp. So that I can change my rely between regexp2 & std.regexp easily by just change the expr text only.

    opened by vipally 5
Owner
Doug Clark
Doug Clark
Super Fast Regex in Go

Rubex : Super Fast Regexp for Go by Zhigang Chen ([email protected] or zhigangc@gmail.com) ONLY USE go1 BRANCH A simple regular expression libr

Moovweb 218 Sep 9, 2022
A simple action that looks for multiple regex matches, in a input text, and returns the key of the first found match.

Key Match Action A simple action that looks for multiple regex matches, in a input text, and returns the key of the first found match. TO RUN Add the

Chris 1 Aug 4, 2022
In-memory, full-text search engine built in Go. For no particular reason.

Motivation I just wanted to learn how to write a search engine from scratch without any prior experience. Features Index content Search content Index

Michele Riva 27 Sep 1, 2022
In-memory, full-text search engine built in Go. For no particular reason.

Motivation I just wanted to learn how to write a search engine from scratch without any prior experience. Features Index content Search content Index

Michele Riva 27 Sep 1, 2022
Takes a full name and splits it into individual name parts

gonameparts gonameparts splits a human name into individual parts. This is useful when dealing with external data sources that provide names as a sing

James Polera 38 Sep 27, 2022
Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Bill Burdick 27 Jul 30, 2022
This package provides Go (golang) types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most famously used by app.diagrams.net, the new name of draw.io.

Go Draw - Golang MX This package provides types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most fa

null 2 Aug 30, 2022
A NMEA parser library in pure Go

go-nmea This is a NMEA library for the Go programming language (Golang). Features Parse individual NMEA 0183 sentences Support for sentences with NMEA

Adrián Moreno 188 Dec 20, 2022
A general purpose syntax highlighter in pure Go

Chroma — A general purpose syntax highlighter in pure Go NOTE: As Chroma has just been released, its API is still in flux. That said, the high-level i

Alec Thomas 3.6k Dec 27, 2022
HTML, CSS and SVG static renderer in pure Go

Web render This module implements a static renderer for the HTML, CSS and SVG formats. It consists for the main part of a Golang port of the awesome W

Benoit KUGLER 7 Apr 19, 2022
A complete Liquid template engine in Go

Liquid Template Parser liquid is a pure Go implementation of Shopify Liquid templates. It was developed for use in the Gojekyll port of the Jekyll sta

Oliver Steele 188 Dec 15, 2022
🌭 The hotdog web browser and browser engine 🌭

This is the hotdog web browser project. It's a web browser with its own layout and rendering engine, parsers, and UI toolkit! It's made from scratch e

Danilo Fragoso 1k Dec 30, 2022
A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Andrew Stuart 222 Dec 12, 2022
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

foolin 78 Dec 13, 2022
A sanitization-based swear filter for Go.

gofuckyourself A sanitization-based swear filter for Go. Installing go get github.com/JoshuaDoes/gofuckyourself Example package main import ( "fmt"

null 52 Oct 2, 2022
Stylesheet-based markdown rendering for your CLI apps 💇🏻‍♀️

Glamour Write handsome command-line tools with Glamour. glamour lets you render markdown documents & templates on ANSI compatible terminals. You can c

Charm 1.4k Jan 1, 2023
Glow is a terminal based markdown reader designed from the ground up to bring out the beauty—and power—of the CLI.💅🏻

Glow Render markdown on the CLI, with pizzazz! What is it? Glow is a terminal based markdown reader designed from the ground up to bring out the beaut

Charm 11.3k Dec 30, 2022
The Markdown-based note-taking app that doesn't suck.

Notable I couldn't find a note-taking app that ticked all the boxes I'm interested in: notes are written and rendered in GitHub Flavored Markdown, no

Notable 20.7k Jan 2, 2023
Generate markdown formatted sprint updates based on the Jira tickets were involved in the given sprint.

Generate markdown formatted sprint updates based on the Jira tickets were involved in the given sprint.

Gábor Boros 2 Nov 15, 2021