Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Overview

Go-edlib : Edit distance and string comparison library

Travis CI Test coverage Go Report Card License: MIT Documentation link PkgGoDev

Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...


Table of Contents


Requirements

  • Go (v1.13+)

Introduction

Golang open-source library which includes most (and soon all) edit-distance and string comparision algorithms with some extra!
Designed to be fully compatible with Unicode characters!
This library is 100% test covered 😁

Features

  • Levenshtein

  • LCS (Longest common subsequence) with edit distance, backtrack and diff functions

  • Hamming

  • Damerau-Levenshtein, with following variants :

    • OSA (Optimal string alignment)
    • Adjacent transpositions
  • Jaro & Jaro-Winkler similarity algorithms

  • Cosine Similarity algorithm to compare strings

  • Computed similarity percentage functions based on all available edit distance algorithms in this lib

  • Fuzzy search functions based on edit distance with unique or multiples strings output

  • Unicode compatibility ! 🥳

  • And many more to come !

Benchmarks

You can check an interactive Google chart with few benchmark cases for all similarity algorithms in this library through StringSimilarity function here

However, if you want or need more details, you can also viewing benchmark raw output here, which also includes memory allocations and test cases output (similarity result and errors).

If you are on Linux and want to run them on your setup, you can run ./tests/benchmark.sh script.

Installation

Open bash into your project folder and run:

go get github.com/hbollon/go-edlib

And import it into your project:

import (
	"github.com/hbollon/go-edlib"
)

Run tests

If you are on Linux and want to run all unit tests just run ./tests/tests.sh script.

For Windows users you can run:

go test ./... # Add desired parameters to this command if you want

Documentation

You can find all the documentation here : Documentation

Examples

Calculate string similarity index between two string

You can use StringSimilarity(str1, str2, algorithm) function. algorithm parameter must one of the following constants:

// Algorithm identifiers
const (
	Levenshtein Algorithm = iota
	DamerauLevenshtein
	OSADamerauLevenshtein
	Lcs
	Hamming
	Jaro
	JaroWinkler
	Cosine
)

Example with levenshtein:

res, err := edlib.StringsSimilarity("string1", "string2", edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Similarity: %f", res)
}

Execute fuzzy search based on string similarity algorithm

1. Most matching unique result without threshold

You can use FuzzySearch(str, strList, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearch("testnig", strList, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result: %s", res)
}
Result: testing 

2. Most matching unique result with threshold

You can use FuzzySearchThreshold(str, strList, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchThreshold("testnig", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig': %s", res)
}

res, err = edlib.FuzzySearchThreshold("hello", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'hello': %s", res)
}
Result for 'testnig': testing
Result for 'hello':

3. Most matching result set without threshold

You can use FuzzySearchSet(str, strList, resultQuantity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSet("testnig", strList, 3, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Results: %s", strings.Join(res, ", "))
}
Results: testing, test, tester 

4. Most matching result set with threshold

You can use FuzzySearchSetThreshold(str, strList, resultQuantity, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.5, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.5' threshold: %s", strings.Join(res, " "))
}

res, err = edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.7' threshold: %s", strings.Join(res, " "))
}
Result for 'testnig' with '0.5' threshold: testing test tester
Result for 'testnig' with '0.7' threshold: testing

Get raw edit distance (Levenshtein, LCS, Damerau–Levenshtein, Hamming)

You can use one of the following function to get an edit distance between two strings :

Example with Levenshtein distance:

res := edlib.LevenshteinDistance("kitten", "sitting")
fmt.Printf("Result: %d", res)
Result: 3

LCS, LCS Backtrack and LCS Diff

1. Compute LCS(Longuest Common Subsequence) between two strings

You can use LCS(str1, str2) function.

lcs := edlib.LCS("ABCD", "ACBAD")
fmt.Printf("Length of their LCS: %d", lcs)
Length of their LCS: 3

2. Backtrack their LCS

You can use LCSBacktrack(str1, str2) function.

res, err := edlib.LCSBacktrack("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", res)
}
LCS: ABD

3. Backtrack all their LCS

You can use LCSBacktrackAll(str1, str2) function.

res, err := edlib.LCSBacktrackAll("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", strings.Join(res, ", "))
}
LCS: ABD, ACD

4. Get LCS Diff between two strings

You can use LCSDiff(str1, str2) function.

res, err := edlib.LCSDiff("computer", "houseboat")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: \n%s\n%s", res[0], res[1])
}
LCS Diff: 
 h c o m p u s e b o a t e r
 + -   - -   + + + + +   - -

Author

👤 Hugo Bollon

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐️ if this project helped you!

📝 License

Copyright © 2020 Hugo Bollon.
This project is MIT License licensed.

Comments
  • Missing methods for string shingling

    Missing methods for string shingling

    Missing methods for string shingling(https://en.wikipedia.org/wiki/W-shingling) aka n-grams for strings. In general, Cosine similarity and Jaccard index, among other methods, are computed on n-grams of the string, instead of the string itself(eg: https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/). Right now, we just split the string on spaces and compute the index based on the words in the string, which is fine but is probably not technically correct. I propose we introduce a method to shingle the given strings so that we can later use the shingled string in finding cosine similarity, etc. This has been done before here: Java: https://github.com/tdebatty/java-string-similarity Julia: https://github.com/matthieugomez/StringDistances.jl

    I think we should introduce a function that does something like:

    // Function signature
    shingle(str string, n int) map[string]int
    // Example usage:
    shingled_s1 := shingle("ABCD", 2) // returns map[AB:1 BC:1 CD:1]
    shingled_s2 := shingle("ABCE", 2) // returns map[AB:1 BC:1 CE:1]
    

    This will enable us to use these maps to later compute stuff, eg:

    res := JaccardSimilarityWithShingles(shingled_s1, shingled_s2) // should return 0.5.
    
    enhancement 
    opened by ShriprajwalK 5
  • Add Shingle and ShingleSlidingWindow

    Add Shingle and ShingleSlidingWindow

    Fixes #10 Added two methods, one that slices length n each time and the other uses sliding window. The first one was slightly faster for all values I checked, so I think we should remove the sliding window method. Sorry for the delay, I wanted to benchmark them both properly but couldn't find the time.

    opened by ShriprajwalK 4
  • Q-gram index and Sorensen-Dice coefficient

    Q-gram index and Sorensen-Dice coefficient

    Currently, we do not have methods for Q-gram and Sorensen-Dice coefficient(https://en.wikipedia.org/wiki/Sørensen–Dice_coefficient) which are fairly simple to implement since they're based on n-grams.

    opened by ShriprajwalK 3
  • LCSBacktrackAll not work

    LCSBacktrackAll not work

    My code:

    ress,err:=edlib.LCSBacktrackAll(" 是ab是cde22f123g", "222222是ab是cd123") if err!=nil{ fmt.Println(err) }else { fmt.Printf("%s", strings.Join(ress, "\r\n")) }

    Shoud be print ` 是ab是cd

    123 `

    But is print 是ab是cd123

    opened by djsousuo 3
  • Jaccard Similarity missing

    Jaccard Similarity missing

    Jaccard similarity coefficient(https://en.wikipedia.org/wiki/Jaccard_index) is a simple way to measure string similarity. I cannot find a function implementing this, is there one? If it is isn't there yet, it can be easily implemented using existing functions from cosine.go

    enhancement 
    opened by ShriprajwalK 2
  • reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m*n))

    reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m*n))

    Hi,this is the pr for issue. I run the benchmark on OSADamerauLevenshtein. Here is the result. | name | before | after | alloc_before | alloc_after | | :-----:| :----: | :----: | :----: | :----: | | sameLengthStringInput | 3389 ns/op |2416 ns/op |23 allocs/op |4 allocs/op | | pneumonoultramicroscopi | 15462 ns/op| 10253 ns/op |51 allocs/op| 8 allocs/op | |こにんちこにんちこに | 4027 ns/op | 3342 ns/op |18 allocs/op | 4 allocs/op| | I_love_horror_movi | 4282 ns/op | 3073 ns/op | 22 allocs/op | 4 allocs/op |

    enhancement 
    opened by tang-hi 1
  • Reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n))

    Reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n))

    Discussed in https://github.com/hbollon/go-edlib/discussions/15

    Originally posted by tang-hi July 2, 2022 When we calculate the distance of OSADamerau–Levenshtein, we actually calculate the matrix with dimension(m*n) Suppose we have a matrix looks like

    1656691342850 the value in (3,0) is always 3 the value in (3,1) depends on (3,0) (2,0) (2,1) the value in (3,2) depends on (3,1) (2,1) (2,2) and (1,1) we can found that the value in (row : i) only depends values in (row:i-1,i-2) eg. the value in (row : 3) only depends values in (row:2,1)

    So any row number >= 3, we could write it in (row %3, j), eg. write the value of (3,0) in (0,0) .... (3,j) in (0,j) write the value of (4,0) in (1,0) .... (4,j) in (1,j) write the value of (i,0) in (i%3,0) .... (i,j) in (i%3,j)

    In this way, we only need the matrix with dimension(3, min(m,n)) to calculate OSADamerau–Levenshtein distance Here is the code to illustrate my idea

    enhancement 
    opened by tang-hi 1
  • chore: release 1.5.0

    chore: release 1.5.0

    :robot: I have created a release *beep* *boop*

    1.5.0 (2021-11-21)

    Features

    • add k-gram shingle to Jaccard/Cosine sim (#11) (24d61a6)

    This PR was generated with Release Please. See documentation.

    autorelease: tagged 
    opened by github-actions[bot] 1
  • StringsSimilarity is broken for unicode

    StringsSimilarity is broken for unicode

    The test is failing

    func TestStringsSimilarity(t *testing.T) {
    	str1 := "abcde"
    	str2 := "бвгдж"
    	str3 := "fghjk"
    	sim12, _ := edlib.StringsSimilarity(str1, str2, edlib.Levenshtein)
    	sim13, _ := edlib.StringsSimilarity(str1, str3, edlib.Levenshtein)
    	if sim12 != sim13 {
    		t.Errorf("different similarities, %v != %v", sim12, sim13)
    	}
    }
    

    output: different similarities, 0.5 != 0

    Similarity between "abcde" and "бвгдж" should be the same as between "abcde" and "fghjk". Looks like bug in the matchingIndex() because it compares length of strings but ...Distance() functions use runes.

    bug 
    opened by burkostya 1
  • chore: release 1.6.0

    chore: release 1.6.0

    :robot: I have created a release *beep* *boop*

    1.6.0 (2022-07-03)

    Features

    • reduce OSADamerau–Levenshtein space complexity from O(m*n) to O(3 * min(m,n)) (#17) (37da2e1)

    This PR was generated with Release Please. See documentation.

    autorelease: pending 
    opened by github-actions[bot] 0
Releases(v1.6.0)
  • v1.6.0(Jan 31, 2022)

  • v1.5.0(Nov 21, 2021)

  • v1.4.0(Sep 6, 2021)

    Features

    • Add Jaccard Index algorithm (#9 )

    Miscellaneous Chores

    • Remove TravisCI
    • Add test & golangci workflows with Github Actions
    • Add golangci config file
    • Fix linters issues
    Source code(tar.gz)
    Source code(zip)
  • v1.3.4(Aug 12, 2021)

  • v1.3.0(Oct 10, 2020)

    Changelog :

    Added :

    • Cosine similarity algorithm
    • Benchmarks package used to benchmark all similarity algorithms
    • Benchmark bash script
    • Add output files to ./tests/outputs/ sub-directory for benchmarks and tests results

    Misc :

    • Unit tests for all new functions + new test cases for existing ones (still 100% covered)

    Changed :

    • Refactored AlgorithMethod type to Algorithm
    • Improved tests.sh script

    Fixed :

    • LCS edit distance issue with Unicode inputs
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Sep 7, 2020)

    Changelog :

    Added :

    • Jaro & Jaro-Winkler similarity algorithms
    • Similarity percentage calculation for all algorithms
    • LCS Backtrack/BacktrackAll functions
    • LCS Diff function

    Misc :

    • Created Go module + TravisCI config
    • Unit tests for all new functions + new test cases for existing ones (still 100% covered)

    Changed :

    • Refactor few functions
    • Deleted useless matrix instanciations
    • Merged test files to separate folder
    Source code(tar.gz)
    Source code(zip)
Owner
Hugo Bollon
Student at University Savoie Mont-Blanc in CS Master. Plan to become DevOps Engineer. I don't know everything but I can learn anything
Hugo Bollon
Some algorithms in go: maxflow(min-cuts or graph-cuts), edit-distance.

Algorithms In this repository, some algorithms are implemented in go language. GoDoc link: ed maxflow About Max-flow problem: A flow network is repres

Yi Deng 15 Sep 8, 2022
Go implementation to calculate Levenshtein Distance.

levenshtein Go package to calculate the Levenshtein Distance The library is fully capable of working with non-ascii strings. But the strings are not n

Agniva De Sarker 236 Sep 30, 2022
Recursively searches a map[string]interface{} structure for another map[string]interface{} structure

msirecurse Recursively searches a map[string]interface{} structure for existence of a map[string]interface{} structure Motivation I wrote this package

Fred Moyer 1 Mar 3, 2022
Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching.

Trie Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching. Usage Create a Trie with: t := trie.New() Add Keys with:

Derek Parker 608 Sep 27, 2022
Go translations of the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

Overview Go translations of the Java source code for the algorithms and clients in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin

youngzy 170 Oct 1, 2022
An in-memory string-interface{} map with various expiration options for golang

TTLCache - an in-memory cache with expiration TTLCache is a simple key/value cache in golang with the following functions: Expiration of items based o

Rene Kroon 492 Sep 28, 2022
An in-memory string-interface{} map with various expiration options for golang

TTLCache - an in-memory cache with expiration TTLCache is a simple key/value cache in golang with the following functions: Expiration of items based o

Rene Kroon 490 Sep 24, 2022
Convert json string to Golang struct

json-to-go-cli Convert json string to Golang struct How to install git clone https://github.com/tiancheng92/json-to-go-cli.git cd json-to-go-cli go bu

TianCheng 7 May 10, 2022
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Charles Banning 526 Sep 27, 2022
Multi-String Pattern Matching Algorithm Using TrieHashNode

Multi-String Pattern Matching algorithm. This implementation is inspired from Aho-Corasick algorithm Getting Started modelA = mspm.NewModel("mspm_mode

Sujit Shakya 18 Sep 26, 2022
Implementation of Boyer-Moore fast string search algorithm in Go

boyermoore Implementation of Boyer-Moore fast string search algorithm in Go

sarp dağ demirel 51 Aug 8, 2022
A fast (5x) string keyed read-only map for Go - particularly good for keys using a small set of nearby runes.

faststringmap faststringmap is a fast read-only string keyed map for Go (golang). For our use case it is approximately 5 times faster than using Go's

The Sensible Code Company 28 Jul 23, 2022
A memory-efficient trie for testing the existence/prefixes of string only(for now).

Succinct Trie A memory-efficient trie for testing the existence/prefixes of string only(for now). Install go get -u github.com/nobekanai/sutrie Docume

野辺かない 2 Mar 10, 2022
Convert arbitrary formats to Go Struct (including json, toml, yaml, etc.)

go2struct Convert arbitrary formats to Go Struct (including json, toml, yaml, etc.) Installation Run the following command under your project: go get

Afeyer 34 Jul 6, 2022
An interesting go struct tag expression syntax for field validation, etc.

An interesting go struct tag expression syntax for field validation, etc.

Bytedance Inc. 1.3k Sep 30, 2022
Generic types that are missing from Go, including sets, trees, sorted lists, etc.

go-typ Generic types that are missing from Go, including sets, trees, sorted lists, etc. All code is implemented with 0 dependencies and in pure Go co

null 14 Jul 26, 2022
Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

docconv A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain tex

Search.io 1k Oct 2, 2022
Algorithms and Data Structures Solved in Golang

Algorithms and Data Structures Solved in Golang Hi! I'm Bruno Melo and this repository contains a lot challenges solved on many plataforms using go as

Bruno Melo 3 Aug 13, 2022
Some data structures and algorithms using golang

Some data structures and algorithms using golang

null 62 Aug 13, 2022