Golang string comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...

Overview

Go-edlib : Edit distance and string comparison library

Travis CI Test coverage Go Report Card License: MIT Documentation link PkgGoDev

Golang string comparison and edit distance algorithms library featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...


Table of Contents


Requirements

  • Go (v1.13+)

Introduction

Golang open-source library which includes most (and soon all) edit-distance and string comparision algorithms with some extra!
Designed to be fully compatible with Unicode characters!
This library is 100% test covered 😁

Features

  • Levenshtein

  • LCS (Longest common subsequence) with edit distance, backtrack and diff functions

  • Hamming

  • Damerau-Levenshtein, with following variants :

    • OSA (Optimal string alignment)
    • Adjacent transpositions
  • Jaro & Jaro-Winkler similarity algorithms

  • Cosine Similarity algorithm to compare strings

  • Computed similarity percentage functions based on all available edit distance algorithms in this lib

  • Fuzzy search functions based on edit distance with unique or multiples strings output

  • Unicode compatibility ! 🥳

  • And many more to come !

Benchmarks

You can check an interactive Google chart with few benchmark cases for all similarity algorithms in this library through StringSimilarity function here

However, if you want or need more details, you can also viewing benchmark raw output here, which also includes memory allocations and test cases output (similarity result and errors).

If you are on Linux and want to run them on your setup, you can run ./tests/benchmark.sh script.

Installation

Open bash into your project folder and run:

go get github.com/hbollon/go-edlib

And import it into your project:

import (
	"github.com/hbollon/go-edlib"
)

Run tests

If you are on Linux and want to run all unit tests just run ./tests/tests.sh script.

For Windows users you can run:

go test ./... # Add desired parameters to this command if you want

Documentation

You can find all the documentation here : Documentation

Examples

Calculate string similarity index between two string

You can use StringSimilarity(str1, str2, algorithm) function. algorithm parameter must one of the following constants:

// Algorithm identifiers
const (
	Levenshtein Algorithm = iota
	DamerauLevenshtein
	OSADamerauLevenshtein
	Lcs
	Hamming
	Jaro
	JaroWinkler
	Cosine
)

Example with levenshtein:

res, err := edlib.StringsSimilarity("string1", "string2", edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Similarity: %f", res)
}

Execute fuzzy search based on string similarity algorithm

1. Most matching unique result without threshold

You can use FuzzySearch(str, strList, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearch("testnig", strList, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result: %s", res)
}
Result: testing 

2. Most matching unique result with threshold

You can use FuzzySearchThreshold(str, strList, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchThreshold("testnig", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig': %s", res)
}

res, err = edlib.FuzzySearchThreshold("hello", strList, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'hello': %s", res)
}
Result for 'testnig': testing
Result for 'hello':

3. Most matching result set without threshold

You can use FuzzySearchSet(str, strList, resultQuantity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSet("testnig", strList, 3, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Results: %s", strings.Join(res, ", "))
}
Results: testing, test, tester 

4. Most matching result set with threshold

You can use FuzzySearchSetThreshold(str, strList, resultQuantity, minSimilarity, algorithm) function.

strList := []string{"test", "tester", "tests", "testers", "testing", "tsting", "sting"}
res, err := edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.5, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.5' threshold: %s", strings.Join(res, " "))
}

res, err = edlib.FuzzySearchSetThreshold("testnig", strList, 3, 0.7, edlib.Levenshtein)
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("Result for 'testnig' with '0.7' threshold: %s", strings.Join(res, " "))
}
Result for 'testnig' with '0.5' threshold: testing test tester
Result for 'testnig' with '0.7' threshold: testing

Get raw edit distance (Levenshtein, LCS, Damerau–Levenshtein, Hamming)

You can use one of the following function to get an edit distance between two strings :

Example with Levenshtein distance:

res := edlib.LevenshteinDistance("kitten", "sitting")
fmt.Printf("Result: %d", res)
Result: 3

LCS, LCS Backtrack and LCS Diff

1. Compute LCS(Longuest Common Subsequence) between two strings

You can use LCS(str1, str2) function.

lcs := edlib.LCS("ABCD", "ACBAD")
fmt.Printf("Length of their LCS: %d", lcs)
Length of their LCS: 3

2. Backtrack their LCS

You can use LCSBacktrack(str1, str2) function.

res, err := edlib.LCSBacktrack("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", res)
}
LCS: ABD

3. Backtrack all their LCS

You can use LCSBacktrackAll(str1, str2) function.

res, err := edlib.LCSBacktrackAll("ABCD", "ACBAD")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: %s", strings.Join(res, ", "))
}
LCS: ABD, ACD

4. Get LCS Diff between two strings

You can use LCSDiff(str1, str2) function.

res, err := edlib.LCSDiff("computer", "houseboat")
if err != nil {
  fmt.Println(err)
} else {
  fmt.Printf("LCS: \n%s\n%s", res[0], res[1])
}
LCS Diff: 
 h c o m p u s e b o a t e r
 + -   - -   + + + + +   - -

Author

👤 Hugo Bollon

🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check issues page.

Show your support

Give a ⭐️ if this project helped you!

📝 License

Copyright © 2020 Hugo Bollon.
This project is MIT License licensed.

Issues
  • LCSDiff prints himself to stdout & Unicode issues with LCS functions

    LCSDiff prints himself to stdout & Unicode issues with LCS functions

    Maybe you forgot to delete after debugging:

    fmt.Printf("%v\n", diff)

    And... thanks for your working on all of this

    bug 
    opened by aloska 7
  • benchmarks

    benchmarks

    Hi, it would be great to see benchmarks between the algorithms. Creating a graph and putting them in the README, would be really nice =)

    documentation 
    opened by andersfylling 5
  • Simplify Algorithm const definitions

    Simplify Algorithm const definitions

    Just shorter code, semantically the same. Tests passed.

    enhancement 
    opened by shogg 3
  • LCSBacktrackAll not work

    LCSBacktrackAll not work

    My code:

    ress,err:=edlib.LCSBacktrackAll(" 是ab是cde22f123g", "222222是ab是cd123") if err!=nil{ fmt.Println(err) }else { fmt.Printf("%s", strings.Join(ress, "\r\n")) }

    Shoud be print ` 是ab是cd

    123 `

    But is print 是ab是cd123

    opened by djsousuo 3
  • Fix a couple of minor erros in README.md

    Fix a couple of minor erros in README.md

    • Add an 's' in StringsSimilarity under Benchmarks
    • Point to Longest common subsequence wikipedia page for LCS
    opened by akosiaris 2
Releases(v1.3.3)
  • v1.3.0(Oct 10, 2020)

    Changelog :

    Added :

    • Cosine similarity algorithm
    • Benchmarks package used to benchmark all similarity algorithms
    • Benchmark bash script
    • Add output files to ./tests/outputs/ sub-directory for benchmarks and tests results

    Misc :

    • Unit tests for all new functions + new test cases for existing ones (still 100% covered)

    Changed :

    • Refactored AlgorithMethod type to Algorithm
    • Improved tests.sh script

    Fixed :

    • LCS edit distance issue with Unicode inputs
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Sep 7, 2020)

    Changelog :

    Added :

    • Jaro & Jaro-Winkler similarity algorithms
    • Similarity percentage calculation for all algorithms
    • LCS Backtrack/BacktrackAll functions
    • LCS Diff function

    Misc :

    • Created Go module + TravisCI config
    • Unit tests for all new functions + new test cases for existing ones (still 100% covered)

    Changed :

    • Refactor few functions
    • Deleted useless matrix instanciations
    • Merged test files to separate folder
    Source code(tar.gz)
    Source code(zip)
Owner
Hugo Bollon
Student at University Savoie Mont-Blanc in CS Master. Plan to become DevOps Engineer. I don't know everything but I can learn anything
Hugo Bollon
Levenshtein distance and similarity metrics with customizable edit costs and Winkler-like bonus for common prefix.

A Go package for calculating the Levenshtein distance between two strings This package implements distance and similarity metrics for strings, based o

AGExt 57 Feb 23, 2021
Go implementation to calculate Levenshtein Distance.

levenshtein Go package to calculate the Levenshtein Distance The library is fully capable of working with non-ascii strings. But the strings are not n

Agniva De Sarker 142 Jun 12, 2021
CLRS study. Codes are written with golang.

algorithms CLRS study. Codes are written with golang. go version: 1.11 Heap BinaryHeap on array BinaryHeap on linkedlist LeftistHeap FibonacciHeap Tre

Apollo Li 527 Jun 12, 2021
Data Structure Libraries and Algorithms implementation

Algorithms Data Structure Libraries and Algorithms implementation in C++ Disclaimer This repository is meant to be used as a reference to learn data s

Priyank Chheda 627 Jun 11, 2021
Graph algorithms and data structures

Your basic graph Golang library of basic graph algorithms Topological ordering, image by David Eppstein, CC0 1.0. This library offers efficient and we

Algorithms to Go 453 Jun 11, 2021
Graph algorithms and data structures

Your basic graph Golang library of basic graph algorithms Topological ordering, image by David Eppstein, CC0 1.0. This library offers efficient and we

Algorithms to Go 9 Jan 25, 2021
Generates data structure definitions from JSON files for any kind of programming language

Overview Archivist generates data structure definitions from JSON files for any kind of programming language. It also provides a library for golang to

Kingsgroup 41 Jun 11, 2021
Some algorithms in go: maxflow(min-cuts or graph-cuts), edit-distance.

Algorithms In this repository, some algorithms are implemented in go language. GoDoc link: ed maxflow About Max-flow problem: A flow network is repres

Yi Deng 5.6k Jun 2, 2021
parody of some of the basic python core features (collections package)

collections import "github.com/marcsantiago/collections" Overview Index Subdirectories Overview Index func StringEncoder(encoder *bytes.Buffer, data D

Marc Santiago 17 Apr 24, 2020
A collection of useful, performant, and threadsafe Go datastructures.

go-datastructures Go-datastructures is a collection of useful, performant, and threadsafe Go datastructures. NOTE: only tested with Go 1.3+. Augmented

Workiva 6.1k Jun 11, 2021
An R-tree implementation for Go

rtree This package provides an in-memory R-Tree implementation for Go. It's designed for Tile38 and is optimized for fast rect inserts and replacement

Josh Baker 159 Jun 9, 2021
Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).

Slim - surprisingly space efficient data types in Golang Slim is collection of surprisingly space efficient data types, with corresponding serializati

null 1.6k Jun 11, 2021
Graph algorithms written in Go

Graph Algorithms in Go This repository contains implementations of various graph algorithms written in Go. I’ve written them to learn about these algo

Thomas Cyron 1.1k May 27, 2021