Go implementation to calculate Levenshtein Distance.

Overview

levenshtein Build Status Go Report Card PkgGoDev

Go package to calculate the Levenshtein Distance

The library is fully capable of working with non-ascii strings. But the strings are not normalized. That is left as a user-dependant use case. Please normalize the strings before passing it to the library if you have such a requirement.

Limitation

As a performance optimization, the library can handle strings only up to 65536 characters (runes). If you need to handle strings larger than that, please pin to version 1.0.3.

Install

go get github.com/agnivade/levenshtein

Example

package main

import (
	"fmt"
	"github.com/agnivade/levenshtein"
)

func main() {
	s1 := "kitten"
	s2 := "sitting"
	distance := levenshtein.ComputeDistance(s1, s2)
	fmt.Printf("The distance between %s and %s is %d.\n", s1, s2, distance)
	// Output:
	// The distance between kitten and sitting is 3.
}

Benchmarks

name              time/op
Simple/ASCII-4     330ns ± 2%
Simple/French-4    617ns ± 2%
Simple/Nordic-4   1.16µs ± 4%
Simple/Tibetan-4  1.05µs ± 1%

name              alloc/op
Simple/ASCII-4     96.0B ± 0%
Simple/French-4     128B ± 0%
Simple/Nordic-4     192B ± 0%
Simple/Tibetan-4    144B ± 0%

name              allocs/op
Simple/ASCII-4      1.00 ± 0%
Simple/French-4     1.00 ± 0%
Simple/Nordic-4     1.00 ± 0%
Simple/Tibetan-4    1.00 ± 0%

Comparisons with other libraries

name                     time/op
Leven/ASCII/agniva-4      353ns ± 1%
Leven/ASCII/arbovm-4      485ns ± 1%
Leven/ASCII/dgryski-4     395ns ± 0%
Leven/French/agniva-4     648ns ± 1%
Leven/French/arbovm-4     791ns ± 0%
Leven/French/dgryski-4    682ns ± 0%
Leven/Nordic/agniva-4    1.28µs ± 1%
Leven/Nordic/arbovm-4    1.52µs ± 1%
Leven/Nordic/dgryski-4   1.32µs ± 1%
Leven/Tibetan/agniva-4   1.12µs ± 1%
Leven/Tibetan/arbovm-4   1.31µs ± 0%
Leven/Tibetan/dgryski-4  1.16µs ± 0%
Comments
  • Optimization

    Optimization

    Hello,

    Mini optimization. Replace this:

    // swap to save some memory O(min(a,b)) instead of O(a)
    if len(s1) > len(s2) {
    	s1, s2 = s2, s1
    }
    lenS1 := len(s1)
    lenS2 := len(s2)
    

    To:

    lenS1 := len(s1)
    lenS2 := len(s2)
    if lenS1 > lenS2 {
    	s1, s2 = s2, s1
    	lenS1 = len(s1)
    	lenS2 = len(s2)
    }
    
    opened by nicola-spb 4
  • Optimising the minimum function for speed : revert to min(a,b)

    Optimising the minimum function for speed : revert to min(a,b)

    name old time/op new time/op delta Simple/ASCII-4 471ns ± 1% 393ns ± 2% -16.44% (p=0.000 n=25+28) Simple/French-4 842ns ± 2% 719ns ± 1% -14.66% (p=0.000 n=28+30) Simple/Nordic-4 1.67µs ± 1% 1.43µs ± 0% -14.59% (p=0.000 n=28+28) Simple/Tibetan-4 1.31µs ± 1% 1.15µs ± 1% -12.15% (p=0.000 n=27+29) [Geo mean] 966ns 826ns -14.48%

    name old alloc/op new alloc/op delta Simple/ASCII-4 96.0B ± 0% 96.0B ± 0% ~ (all equal) Simple/French-4 128B ± 0% 128B ± 0% ~ (all equal) Simple/Nordic-4 192B ± 0% 192B ± 0% ~ (all equal) Simple/Tibetan-4 144B ± 0% 144B ± 0% ~ (all equal) [Geo mean] 136B 136B +0.00%

    name old allocs/op new allocs/op delta Simple/ASCII-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) Simple/French-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) Simple/Nordic-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) Simple/Tibetan-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) [Geo mean] 1.00 1.00 +0.00%

    opened by psadac 4
  • Optimize: reduce allocs

    Optimize: reduce allocs

    Use fixed size slice to reduce allocs.

    benchmark                     old ns/op     new ns/op     delta
    BenchmarkSimple/ASCII-8       324           294           -9.09%
    BenchmarkSimple/French-8      620           584           -5.71%
    BenchmarkSimple/Nordic-8      1187          1188          +0.08%
    BenchmarkSimple/Tibetan-8     1044          1016          -2.68%
    
    benchmark                     old allocs     new allocs     delta
    BenchmarkSimple/ASCII-8       1              0              -100.00%
    BenchmarkSimple/French-8      1              0              -100.00%
    BenchmarkSimple/Nordic-8      1              0              -100.00%
    BenchmarkSimple/Tibetan-8     1              0              -100.00%
    
    benchmark                     old bytes     new bytes     delta
    BenchmarkSimple/ASCII-8       24            0             -100.00%
    BenchmarkSimple/French-8      32            0             -100.00%
    BenchmarkSimple/Nordic-8      48            0             -100.00%
    BenchmarkSimple/Tibetan-8     48            0             -100.00%
    
    
    opened by zhyon404 3
  • go version 1.13 causing build errors

    go version 1.13 causing build errors

    Getting a build error in one of my project which has an indirect dependency on this repo, because the go version in your go.mod is set to go version 1.13, but this version doesn't seem to exist yet, so I can't upgrade to match? Any advice appreciated!

    opened by EwanValentine 3
  • Suspicious score

    Suspicious score

    Hi,

    I was playing around with it, and I received these results:

    agw.be =? azuregateway-1faf587hg-daee-4d4f-8fb3-6c2c7458e30b-33abaad5af1b.vpn.chinacloudapi.cn (levenshtein score is80)
    

    Idea why this has such a high score?

    opened by hazcod 1
  • Use uint16 for storing count

    Use uint16 for storing count

    A slice of uint16s fit more snugly in L1 cache and results in fewer misses.

    Unfortunately, this limits the length of strings that can be used to 65536.

    Benchmark:

    name              old time/op  new time/op  delta
    Simple/ASCII-4     379ns ± 2%   330ns ± 2%  -13.03%  (p=0.000 n=10+10)
    Simple/French-4    692ns ± 2%   617ns ± 2%  -10.75%  (p=0.000 n=10+10)
    Simple/Nordic-4   1.28µs ± 1%  1.16µs ± 4%   -9.64%  (p=0.000 n=8+10)
    Simple/Tibetan-4  1.16µs ± 3%  1.05µs ± 1%   -9.25%  (p=0.000 n=10+10)
    
    opened by agnivade 0
  • Optimize: skip setting zero during slice init

    Optimize: skip setting zero during slice init

    Default value is already 0

    Good enough improvement for a one character change.

    name              old time/op    new time/op    delta
    Simple/ASCII-4       380ns ± 3%     363ns ± 1%  -4.63%  (p=0.000 n=10+9)
    Simple/French-4      694ns ± 2%     647ns ± 2%  -6.76%  (p=0.000 n=10+10)
    Simple/Nordic-4     1.30µs ± 5%    1.24µs ± 6%  -4.34%  (p=0.027 n=10+10)
    Simple/Tibetan-4    1.14µs ± 6%    1.09µs ± 1%  -4.40%  (p=0.000 n=10+9)
    
    name              old alloc/op   new alloc/op   delta
    Simple/ASCII-4       96.0B ± 0%     96.0B ± 0%    ~     (all equal)
    Simple/French-4       128B ± 0%      128B ± 0%    ~     (all equal)
    Simple/Nordic-4       192B ± 0%      192B ± 0%    ~     (all equal)
    Simple/Tibetan-4      144B ± 0%      144B ± 0%    ~     (all equal)
    
    name              old allocs/op  new allocs/op  delta
    Simple/ASCII-4        1.00 ± 0%      1.00 ± 0%    ~     (all equal)
    Simple/French-4       1.00 ± 0%      1.00 ± 0%    ~     (all equal)
    Simple/Nordic-4       1.00 ± 0%      1.00 ± 0%    ~     (all equal)
    Simple/Tibetan-4      1.00 ± 0%      1.00 ± 0%    ~     (all equal)
    
    opened by agnivade 0
  • [Optimized]: Removed several bounds checks

    [Optimized]: Removed several bounds checks

    • For the init loop, The compiler is unable to prove that "<= lenS1" is essentially same as "< len(x)". So we mention it explicitly.

    • Made a dummy bounds check to prevent further checks for x[lenS1] down below. And this also takes care of x[j-1] since j is <= lenS1.

    This effectively reduces the number of bounds checks from 7 to just 1, which is the dummy call to x[lenS1]. The compiler, as of 1.12, is unable to determine that lenS1 is indeed a positive number. So lenS1+1 will be positive, and hence x[lenS1] should be within bounds of x. As a result, this is unavoidable.

    name              old time/op  new time/op  delta
    Simple/ASCII-4     473ns ± 2%   365ns ± 1%  -22.85%  (p=0.000 n=10+10)
    Simple/French-4    897ns ± 2%   680ns ± 2%  -24.23%  (p=0.000 n=10+10)
    Simple/Nordic-4   1.83µs ± 1%  1.33µs ± 2%  -27.23%  (p=0.000 n=8+9)
    Simple/Tibetan-4  1.40µs ± 2%  1.15µs ± 2%  -17.56%  (p=0.000 n=10+10)
    
    opened by agnivade 0
  • Support Levenshtein substring matching

    Support Levenshtein substring matching

    It may become very useful to some to provide approximate substring matching. This reports the smallest edit distance of the needle in all possible substrings of haystack. Here are some examples:

    tests := []struct {
    		needle, haystack string
    		want int
    	}{
    		{"Machu Picchu", "Machu Picchu", 0},
    		{"Machu Picchu", "Where is Machu Picchu?", 0},
    		{"Machu Picchu", "Where is Machu Pichu?", 1},
    		{"Machu Picchu", "Where is Macchu Picchu?", 1},
    		{"Stonehenge", "How do I get to Stone Henge?", 2},
    		{"", "", 0},
    		{"", "Machu Picchu", 0},
    		{"u", "Machu Picchu", 0},
    		{"z", "Machu Picchu", 1},
    		{"Uluru", "", 5},
    	}
    

    All that is necessary is to initialize the row with all zeroes, run the main algorithm, and then return the smallest value in the row. I have developed some code and was wondering about your thoughts about including it (or some more optimized variant) in this package:

    func ComputeSubstringDistance(needle, haystack string) int {
    	if len(needle) == 0 {
    		return 0
    	}
    
    	if len(haystack) == 0 {
    		return utf8.RuneCountInString(needle)
    	}
    
    	if needle == haystack {
    		return 0
    	}
    
    	// We need to convert to []rune if the strings are non-ascii.
    	// This could be avoided by using utf8.RuneCountInString
    	// and then doing some juggling with rune indices.
    	// The primary challenge is keeping track of the previous rune.
    	// With needle range loop, its not that easy. And with needle for-loop
    	// we need to keep track of the inter-rune width using utf8.DecodeRuneInString
    	s1 := []rune(haystack)
    	s2 := []rune(needle)
    
    	lenS1 := len(s1)
    	lenS2 := len(s2)
    
    	// init the row
    	x := make([]int, lenS1+1)
    
    	// make needle dummy bounds check to prevent the 2 bounds check down below.
    	// The one inside the loop is particularly costly.
    	_ = x[lenS1]
    	// fill in the rest
    	for i := 1; i <= lenS2; i++ {
    		prev := i
    		var current int
    		for j := 1; j <= lenS1; j++ {
    			if s2[i-1] == s1[j-1] {
    				current = x[j-1] // match
    			} else {
    				current = min(min(x[j-1]+1, prev+1), x[j]+1)
    			}
    			x[j-1] = prev
    			prev = current
    		}
    		x[lenS1] = prev
    	}
    
    	min := x[0]
    	for i := 1; i <= lenS1; i++ {
    		if x[i] < min {
    			min = x[i]
    		}
    	}
    	return min
    }
    
    opened by dsamarin 3
Owner
Agniva De Sarker
Agniva De Sarker
Levenshtein distance and similarity metrics with customizable edit costs and Winkler-like bonus for common prefix.

A Go package for calculating the Levenshtein distance between two strings This package implements distance and similarity metrics for strings, based o

AGExt 73 Nov 20, 2022
Some algorithms in go: maxflow(min-cuts or graph-cuts), edit-distance.

Algorithms In this repository, some algorithms are implemented in go language. GoDoc link: ed maxflow About Max-flow problem: A flow network is repres

Yi Deng 15 Sep 8, 2022
Go implementation of Count-Min-Log

Count-Min-Log Count-Min-Log sketch: Approximately counting with approximate counters - Guillaume Pitel & Geoffroy Fouquier TL;DR: Count-Min-Log Sketch

Seif Lotfy 58 Sep 26, 2022
A Go implementation of the Elias-Fano encoding

go-ef A Go implementation of the Elias-Fano encoding Example package main import ( "fmt" "github.com/amallia/go-ef" "os" ) func main() {

Antonio Mallia 27 Nov 23, 2022
Set is a useful collection but there is no built-in implementation in Go lang.

goset Set is a useful collection but there is no built-in implementation in Go lang. Why? The only one pkg which provides set operations now is golang

zoumo 50 Sep 26, 2022
A skip list implementation in Go

About This is a library implementing skip lists for the Go programming language (http://golang.org/). Skip lists are a data structure that can be used

Ric (Ryszard) Szopa 236 Sep 21, 2022
Go implementation of C++ STL iterators and algorithms.

iter Go implementation of C++ STL iterators and algorithms. Less hand-written loops, more expressive code. README translations: 简体中文 Motivation Althou

disksing 172 Nov 16, 2022
A Merkle Tree implementation written in Go.

Merkle Tree in Golang An implementation of a Merkle Tree written in Go. A Merkle Tree is a hash tree that provides an efficient way to verify the cont

Cameron Bergoon 399 Nov 26, 2022
A prefix tree implementation in go

Trie (Prefix tree) This library is compatible with Go 1.11+ Please refer to CHANGELOG.md if you encounter breaking changes. Motivation Introduction Us

Viant, Inc 31 Nov 3, 2022
Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

Tanner Ryan 130 Nov 20, 2022
A simple Set data structure implementation in Go (Golang) using LinkedHashMap.

Set Set is a simple Set data structure implementation in Go (Golang) using LinkedHashMap. This library allow you to get a set of int64 or string witho

Studio Sol Comunicação Digital Ltda 22 Sep 26, 2022
A Go library for an efficient implementation of a skip list: https://godoc.org/github.com/MauriceGit/skiplist

Fast Skiplist Implementation This Go-library implements a very fast and efficient Skiplist that can be used as direct substitute for a balanced tree o

Maurice Tollmien 239 Nov 25, 2022
A slice-based implementation of a stack. In Go!

Stackgo Stackgo is a slice-based implementation of a simple stack in Go. It uses a pre-alloc pagination strategy which adds little memory overhead to

Alessandro Diaferia 16 Nov 3, 2022
A Left-Leaning Red-Black (LLRB) implementation of balanced binary search trees for Google Go

GoLLRB GoLLRB is a Left-Leaning Red-Black (LLRB) implementation of 2-3 balanced binary search trees in Go Language. Overview As of this writing and to

Petar Maymounkov 742 Nov 21, 2022
Golang implementation of Radix trees

go-radix Provides the radix package that implements a radix tree. The package only provides a single Tree implementation, optimized for sparse nodes.

Armon Dadgar 759 Nov 18, 2022
gtreap is an immutable treap implementation in the Go Language

gtreap gtreap is an immutable treap implementation in the Go Language Overview gtreap implements an immutable treap data structure in golang. By treap

Steve Yen 84 May 17, 2022
An yet-another red-black tree implementation, with a C++ STL-like API.

A red-black tree with an API similar to C++ STL's. INSTALLATION go get github.com/yasushi-saito/rbtree EXAMPLE More examples can be fou

Yasushi Saito 18 Apr 25, 2022
A simple Bloom Filter implementation in Go

This is a simple Bloom filter implementation written in Go. For the theory behind Bloom filters, read http://en.wikipedia.org/wiki/Bloom_filter This

Damian Gryski 16 Apr 26, 2018
HyperLogLog and HyperLogLog++ implementation in Go/Golang.

HyperLogLog and HyperLogLog++ Implements the HyperLogLog and HyperLogLog++ algorithms. HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/F

Clark DuVall 424 Nov 24, 2022