Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Overview

Go-edlib : Edit distance and string comparison library

Travis CI Test coverage Go Report Card License: MIT Documentation link PkgGoDev

Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...


Table of Contents


Requirements

  • Go (v1.13+)

Introduction

Golang open-source library which includes most (and soon all) edit-distance and string comparision algorithms with some extra!
Designed to be fully compatible with Unicode characters!
This library is 100% test covered 😁

Features

  • Levenshtein

  • LCS (Longest common subsequence) with edit distance, backtrack and diff functions

  • Hamming

  • Damerau-Levenshtein, with following variants :

    • OSA (Optimal string alignment)
    • Adjacent transpositions
  • Jaro & Jaro-Winkler similarity algorithms

  • Cosine Similarity algorithm to compare strings

  • Computed similarity percentage functions based on all available edit distance algorithms in this lib

  • Fuzzy search functions based on edit distance with unique or multiples strings output

  • Unicode compatibility ! 🥳

  • And many more to come !

Benchmarks

You can check an interactive Google chart with few benchmark cases for all similarity algorithms in this library through StringSimilarity function here

However, if you want or need more details, you can also viewing benchmark raw output here, which also includes memory allocations and test cases output (similarity result and errors).

If you are on Linux and want to run them on your setup, you can run ./tests/benchmark.sh script.

Installation

Open bash into your project folder and run:

go get github.com/hbollon/go-edlib

And import it into your project:

import (
	"github.com/hbollon/go-edlib"
)

Run tests

If you are on Linux and want to run all unit tests just run ./tests/tests.sh script.

For Windows users you can run:

go test ./... # Add desired parameters to this command if you want

Documentation

You can find all the documentation here : Documentation

Examples

Calculate string similarity index between two string

You can use StringSimilarity(str1, str2, algorithm) function. algorithm parameter must one of the following constants:

// Algorithm identifiers
const (
	Levenshtein Algorithm = iota
	DamerauLevenshtein
	OSADamerauLevenshtein
	Lcs
	Hamming
	Jaro
	JaroWinkler
	Cosine
)

Example with levenshtein:

res, err := edlib.StringsSimilarity("string1", "string2", edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Similarity: %f", res)
}

Execute fuzzy search based on string similarity algorithm

1. Most matching unique result without threshold

You can use FuzzySearch(str, strList, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearch("testnig", strList, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result: %s", res)
}
Result: testing 

2. Most matching unique result with threshold

You can use FuzzySearchThreshold(str, strList, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchThreshold("testnig", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig': %s", res)
}

res, err = edlib.FuzzySearchThreshold("hello", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'hello': %s", res)
}
Result for 'testnig': testing
Result for 'hello':

3. Most matching result set without threshold

You can use FuzzySearchSet(str, strList, resultQuantity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSet("testnig", strList, 3, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Results: %s", strings.Join(res, ", "))
}
Results: testing, test, tester 

4. Most matching result set with threshold

You can use FuzzySearchSetThreshold(str, strList, resultQuantity, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.5, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.5' threshold: %s", strings.Join(res, " "))
}

res, err = edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.7' threshold: %s", strings.Join(res, " "))
}
Result for 'testnig' with '0.5' threshold: testing test tester
Result for 'testnig' with '0.7' threshold: testing

Get raw edit distance (Levenshtein, LCS, Damerau–Levenshtein, Hamming)

You can use one of the following function to get an edit distance between two strings :

Example with Levenshtein distance:

res := edlib.LevenshteinDistance("kitten", "sitting")
fmt.Printf("Result: %d", res)
Result: 3

LCS, LCS Backtrack and LCS Diff

1. Compute LCS(Longuest Common Subsequence) between two strings

You can use LCS(str1, str2) function.

lcs := edlib.LCS("ABCD", "ACBAD")
fmt.Printf("Length of their LCS: %d", lcs)
Length of their LCS: 3

2. Backtrack their LCS

You can use LCSBacktrack(str1, str2) function.

res, err := edlib.LCSBacktrack("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", res)
}
LCS: ABD

3. Backtrack all their LCS

You can use LCSBacktrackAll(str1, str2) function.

res, err := edlib.LCSBacktrackAll("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", strings.Join(res, ", "))
}
LCS: ABD, ACD

4. Get LCS Diff between two strings

You can use LCSDiff(str1, str2) function.

res, err := edlib.LCSDiff("computer", "houseboat")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: \n%s\n%s", res[0], res[1])
}
LCS Diff: 
 h c o m p u s e b o a t e r
 + -   - -   + + + + +   - -

Author

👤 Hugo Bollon

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐️ if this project helped you!

📝 License

Copyright © 2020 Hugo Bollon.
This project is MIT License licensed.

Issues
  • Missing methods for string shingling

    Missing methods for string shingling

    Missing methods for string shingling(https://en.wikipedia.org/wiki/W-shingling) aka n-grams for strings. In general, Cosine similarity and Jaccard index, among other methods, are computed on n-grams of the string, instead of the string itself(eg: https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/). Right now, we just split the string on spaces and compute the index based on the words in the string, which is fine but is probably not technically correct. I propose we introduce a method to shingle the given strings so that we can later use the shingled string in finding cosine similarity, etc. This has been done before here: Java: https://github.com/tdebatty/java-string-similarity Julia: https://github.com/matthieugomez/StringDistances.jl

    I think we should introduce a function that does something like:

    // Function signature
    shingle(str string, n int) map[string]int
    // Example usage:
    shingled_s1 := shingle("ABCD", 2) // returns map[AB:1 BC:1 CD:1]
    shingled_s2 := shingle("ABCE", 2) // returns map[AB:1 BC:1 CE:1]
    

    This will enable us to use these maps to later compute stuff, eg:

    res := JaccardSimilarityWithShingles(shingled_s1, shingled_s2) // should return 0.5.
    
    enhancement 
    opened by ShriprajwalK 5
  • Add Shingle and ShingleSlidingWindow

    Add Shingle and ShingleSlidingWindow

    Fixes #10 Added two methods, one that slices length n each time and the other uses sliding window. The first one was slightly faster for all values I checked, so I think we should remove the sliding window method. Sorry for the delay, I wanted to benchmark them both properly but couldn't find the time.

    opened by ShriprajwalK 4
  • Q-gram index and Sorensen-Dice coefficient

    Q-gram index and Sorensen-Dice coefficient

    Currently, we do not have methods for Q-gram and Sorensen-Dice coefficient(https://en.wikipedia.org/wiki/Sørensen–Dice_coefficient) which are fairly simple to implement since they're based on n-grams.

    opened by ShriprajwalK 3
  • LCSBacktrackAll not work

    LCSBacktrackAll not work

    My code:

    ress,err:=edlib.LCSBacktrackAll(" 是ab是cde22f123g", "222222是ab是cd123") if err!=nil{ fmt.Println(err) }else { fmt.Printf("%s", strings.Join(ress, "\r\n")) }

    Shoud be print ` 是ab是cd

    123 `

    But is print 是ab是cd123

    opened by djsousuo 3
  • Jaccard Similarity missing

    Jaccard Similarity missing

    Jaccard similarity coefficient(https://en.wikipedia.org/wiki/Jaccard_index) is a simple way to measure string similarity. I cannot find a function implementing this, is there one? If it is isn't there yet, it can be easily implemented using existing functions from cosine.go

    enhancement 
    opened by ShriprajwalK 2
  • chore: release 1.5.0

    chore: release 1.5.0

    :robot: I have created a release *beep* *boop*

    1.5.0 (2021-11-21)

    Features

    • add k-gram shingle to Jaccard/Cosine sim (#11) (24d61a6)

    This PR was generated with Release Please. See documentation.

    autorelease: tagged 
    opened by github-actions[bot] 1
  • StringsSimilarity is broken for unicode

    StringsSimilarity is broken for unicode

    The test is failing

    func TestStringsSimilarity(t *testing.T) {
    	str1 := "abcde"
    	str2 := "бвгдж"
    	str3 := "fghjk"
    	sim12, _ := edlib.StringsSimilarity(str1, str2, edlib.Levenshtein)
    	sim13, _ := edlib.StringsSimilarity(str1, str3, edlib.Levenshtein)
    	if sim12 != sim13 {
    		t.Errorf("different similarities, %v != %v", sim12, sim13)
    	}
    }
    

    output: different similarities, 0.5 != 0

    Similarity between "abcde" and "бвгдж" should be the same as between "abcde" and "fghjk". Looks like bug in the matchingIndex() because it compares length of strings but ...Distance() functions use runes.

    bug 
    opened by burkostya 1
Releases(v1.6.0)
  • v1.6.0(Jan 31, 2022)

  • v1.5.0(Nov 21, 2021)

  • v1.4.0(Sep 6, 2021)

    Features

    • Add Jaccard Index algorithm (#9 )

    Miscellaneous Chores

    • Remove TravisCI
    • Add test & golangci workflows with Github Actions
    • Add golangci config file
    • Fix linters issues
    Source code(tar.gz)
    Source code(zip)
  • v1.3.4(Aug 12, 2021)

  • v1.3.0(Oct 10, 2020)

    Changelog :

    Added :

    • Cosine similarity algorithm
    • Benchmarks package used to benchmark all similarity algorithms
    • Benchmark bash script
    • Add output files to ./tests/outputs/ sub-directory for benchmarks and tests results

    Misc :

    • Unit tests for all new functions + new test cases for existing ones (still 100% covered)

    Changed :

    • Refactored AlgorithMethod type to Algorithm
    • Improved tests.sh script

    Fixed :

    • LCS edit distance issue with Unicode inputs
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Sep 7, 2020)

    Changelog :

    Added :

    • Jaro & Jaro-Winkler similarity algorithms
    • Similarity percentage calculation for all algorithms
    • LCS Backtrack/BacktrackAll functions
    • LCS Diff function

    Misc :

    • Created Go module + TravisCI config
    • Unit tests for all new functions + new test cases for existing ones (still 100% covered)

    Changed :

    • Refactor few functions
    • Deleted useless matrix instanciations
    • Merged test files to separate folder
    Source code(tar.gz)
    Source code(zip)
Owner
Hugo Bollon
Student at University Savoie Mont-Blanc in CS Master. Plan to become DevOps Engineer. I don't know everything but I can learn anything
Hugo Bollon
GNU Aspell spell checking library bindings for Go (golang)

Aspell library bindings for Go GNU Aspell is a spell checking tool written in C/C++. This package provides simplified Aspell bindings for Go. It uses

Vladimir Sibirov 44 Nov 29, 2021
A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Peter Kleiweg 67 Feb 13, 2022
Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Go-edlib : Edit distance and string comparison library Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Ham

Hugo Bollon 329 Jun 21, 2022
Levenshtein distance and similarity metrics with customizable edit costs and Winkler-like bonus for common prefix.

A Go package for calculating the Levenshtein distance between two strings This package implements distance and similarity metrics for strings, based o

AGExt 68 Apr 12, 2022
Some algorithms in go: maxflow(min-cuts or graph-cuts), edit-distance.

Algorithms In this repository, some algorithms are implemented in go language. GoDoc link: ed maxflow About Max-flow problem: A flow network is repres

Yi Deng 14 Apr 9, 2021
Levenshtein distance for golang

levenshtein An implementation of the Levenshtein distance for Golang Sources Levenshtein distance Golang Install go get github.com/gitchander/levensht

Chander 0 Jun 8, 2021
Go implementation to calculate Levenshtein Distance.

levenshtein Go package to calculate the Levenshtein Distance The library is fully capable of working with non-ascii strings. But the strings are not n

Agniva De Sarker 206 Jun 20, 2022
View, edit, and save text files via http to the file system.

go-wiki View, edit, and save text files via http to the file system. (DONE) https://golang.org/doc/articles/wiki/ Instructions go run main.go In a web

Clint Vidler 0 Nov 25, 2021
go-editor is the clean go module that refractors from Kubernetes to help you edit resources in a command-line way.

go-editor The source code of go-editor comes from Kubernetes and refractor as the clean Go module. You can embed go-editor in your command-line tool l

Yong 0 Dec 5, 2021
Todo-cmd: an app you can add your tasks , edit or delete them

TODO CMD APP! ??‍♂️ Table of contents General info Update Requirements set-up usage General info todo-cmd is an app you can add your tasks , edit or d

dozheiny 3 Dec 13, 2021
Inflection is a string transformation library. It transforms strings from CamelCase to underscored string.

Inflection Inflection is a string transformation library. It transforms strings from CamelCase to underscored string. This is an implement of Inflecti

null 1 Nov 26, 2021
✨ #PTerm is a modern go module to beautify console output. Featuring charts, progressbars, tables, trees, and many more 🚀 It's completely configurable and 100% cross-platform compatible.

?? PTerm | Pretty Terminal Printer A golang module to print pretty text Show Demo Code PTerm.sh | Installation | Documentation | Quick Start | Example

null 2.8k Jun 30, 2022
Headless CMS with automatic JSON API. Featuring auto-HTTPS from Let's Encrypt, HTTP/2 Server Push, and flexible server framework written in Go.

Ponzu Watch the video introduction Ponzu is a powerful and efficient open-source HTTP server framework and CMS. It provides automatic, free, and secur

Ponzu 5.5k Jun 29, 2022
⚗ The most advanced CLI template on earth! Featuring automatic releases, website generation and a custom CI-System out of the box.

cli-template ✨ ⚗ A template for beautiful, modern, cross-platform compatible CLI tools written with Go! Getting Started | Wiki This template features

null 33 Jun 21, 2022
Jaken - A general purpose IRC bot featuring user acls and arbitrary plugins

Design principles This bot is based on the premise of a loosely coupling between

Lex van Roon 1 Feb 4, 2022
"We will game" blockchain featuring the "Willgame coin".

We will game "We will game" blockchain featuring the "Willgame coin". Our vision is to become the number one design company for play to earn decentral

Willis Ayres 2 Jan 2, 2022
Golang metrics for calculating string similarity and other string utility functions

strutil strutil provides string metrics for calculating string similarity as well as other string utility functions. Full documentation can be found a

Adrian-George Bostan 99 Jun 30, 2022
Recursively searches a map[string]interface{} structure for another map[string]interface{} structure

msirecurse Recursively searches a map[string]interface{} structure for existence of a map[string]interface{} structure Motivation I wrote this package

Fred Moyer 1 Mar 3, 2022
Extremely flexible golang deep comparison, extends the go testing package and tests HTTP APIs

go-testdeep Extremely flexible golang deep comparison, extends the go testing package. Latest news Synopsis Description Installation Functions Availab

Maxime Soulé 303 Jun 19, 2022
Extremely flexible golang deep comparison, extends the go testing package, tests HTTP APIs and provides tests suite

go-testdeep Extremely flexible golang deep comparison, extends the go testing package. Latest news Synopsis Description Installation Functions Availab

Maxime Soulé 305 Jun 29, 2022
Fastest levenshtein implementation in Go.

fast-levenshtein ?? Fastest levenshtein implementation in Go. Measure the difference between two strings. note: this implementation is currently not t

null 113 May 21, 2022
RoutePlanner suggests circular walks or runs based on start location and desired distance.

RoutePlanner Backend This repository contains code that powers the routeplanner app. The app suggests circular walks or runs based on start location a

null 0 Nov 5, 2021
A min-distance problem where it requires to find K nearest points to a specific location

Elizabeths-phaeton Elizabeths-Phaeton is a min-distance problem where it requires to find K nearest points to a specific location. Inorder to do so, w

Bardia Ardakanian 4 May 11, 2022
Simple load-balancer for npchat servers, based on the xor distance metric between node & user id

npchat-helmsman Simple load-balancer for npchat servers, based on the xor distance metric between node & user id. Local Development Clone this reposit

npchat 0 Jan 15, 2022
Go package to nicely calculate distance between coordinates using the Haversine formula.

go-haversine Heavily inspired by Umahmood's haversine, go-haversine provides a nice Go interface to calculate distance between coordinates using the h

Dametto Luca 2 Apr 2, 2022
A mining pool proxy tool, support BTC, ETH, ETC, XMR mining pool, etc.

Tier2Pool A mining pool proxy tool, support BTC, ETH, ETC, XMR mining pool, etc. Build I use Ubuntu as a demo. sudo update sudo apt install git make s

Tier2Pool 6 May 19, 2022
cmd tool for automatic storage and comparison of benchmarks results

prettybenchcmp prettybenchcmp is cmd tool for storage and comparison of benchmarks results. There is a standard tool benchcmp, but I don't think that

Petr 18 Apr 6, 2021
Comparison of Pixel and Ebiten API on the trees tutorial

Rewriting the trees tutorial of Pixel with Ebiten for API comparison I tried Pixel and really liked the clean API but the dev seems to be on pause sin

null 4 Dec 7, 2021
Plugs module to see different types of plug types needed in different countries, and a comparison tool between two countries plug socket types.

plugs Importing the module: go get github.com/matthewboyd/plugs "github.com/matthewboyd/plugs" How to use the module: There are two functions wi

Matthew Boyd 2 Dec 28, 2021