agrep-like fuzzy matching, but made faster using Golang and precomputation.

Overview

Version 2.0 Coverage GoDoc

goagrep

There are situations where you want to take the user's input and match a primary key in a database. But, immediately a problem is introduced: what happens if the user spells the primary key incorrectly? This fuzzy string matching program solves this problem - it takes any string, misspelled or not, and matches to one a specified key list.

About

goagrep requires building a precomputed database from the file that has the target strings. Then, when querying, goagrep splits the search string into smaller subsets, and then finds the corresponding known target strings that contain each subset. It then runs Levenshtein's algorithm on the new list of target strings to find the best match to the search string. This greatly decreases the search space and thus increases the matching speed.

The subset length dictates how many pieces a word should be cut into, for purposes of finding partial matches for mispelled words. For instance example: a subset length of 3 for the word "olive" would index "oli", "liv", and "ive". This way, if one searched for "oliv" you could still return "olive" since subsets "oli" and "liv" can still grab the whole word and check its Levenshtein distance (which should be very close as its only missing the one letter).

A smaller subset length will be more forgiving (it allows more mispellings), thus more accurate, but it would require more disk and more time to process since there are more words for each subset. A bigger subset length will help save hard drive space and decrease the runtime since there are fewer words that have the same, longer, subset. You can get much faster speeds with longer subset lengths, but keep in mind that this will not be able to match strings that have an error in the middle of the string and are have a length < 2*subset length - 1.

Why use goagrep?

It seems that agrep really a comparable choice for most applications. It does not require any database and its comparable speed to goagrep. However, there are situations where goagrep is more useful:

  1. goagrep can search much longer strings: agrep is limited to 32 characters while goagrep is only limited to 500. tre-agrep is not limited, but is much slower.
  2. goagrep can handle many mistakes in a string: agrep is limited to edit distances of 8, while goagrep has no limit.
  3. goagrep is fast (see benchmarking below), and the speed can be tuned: You can set higher subset lengths to get faster speeds and less accuracy - leaving the tradeoff up to you

Benchmarking

Benchmarking using the 319,378 word dictionary (3.5 MB), run with perf stat -r 50 -d <CMD> or using go test -bench=Match using AMD FX(tm)-8350.

Program Runtime Memory usage
goagrep in memory, subset size = 5 0.2 ms 90 MB ram
goagrep DB, subset size = 5 0.9 ms 64 MB disk
goagrep in memory, subset size = 3 18 ms 90 MB ram
goagrep DB, subset size = 3 71 ms 64 MB disk
agrep 7 ms 3.5 MB disk
tre-agrep 613 ms 3.5 MB disk

Installation

go get github.com/schollz/goagrep

Usage

You can either build a hard-disk database or use it in memory (i.e. in a program or as a TCP client). See main.go or tests for examples.

TCP server example

Start a server with a list:

> $GOPATH/bin/goagrep serve -l ../example/testlist
2016/07/23 07:05:35 Creating server with address localhost:9992

And then you can match words using netcat or similar:

> echo heroint | nc localhost 9992
heroine

Standalone cli utility

Build a database with a list:

> $GOPATH/bin/goagrep build -l ../example/testlist
Generating 'words.db' from '../example/testlist' with subset size 3
Finished building db

And then you can match by telling the program where the database is:

> $GOPATH/bin/goagrep match -d words.db -w heroint
heroine|||1

History

  • Make commmand line stuff with github.com/codegangsta/cli
  • Command line help
  • Command line for generating cache
  • Convert to lowercase for converting
  • Vastly increased DB building by decreasing size of partials (make([]string, 500)) and making extra buckets
  • Handle case that word definetly does not exist
  • Save searches, so caching can be used to find common searches easily
  • Use channels for faster searching?
  • Add in agrep options
Comments
  • Decouple BoltDB

    Decouple BoltDB

    Scanning through the code, I'm wondering how difficult it would be to decouple BoltDB. If you made a simple interface, you should be able to swap out BoltDB for just a simple in memory map for smaller datasets. That wouldn't require any external assets and should perform even faster.

    opened by derekperkins 3
  • Add runtime option to build db

    Add runtime option to build db

    I'd love to be able to build a db on the fly without having to use the filesystem. This could be used in conjunction with #10 to create the index on instance startup for example.

    func GenerateDBFromStrings(strings []string, databasePath string, tupleLength int, verbosity bool)
    
    enhancement 
    opened by derekperkins 3
  • Is the benchmark db size accurate?

    Is the benchmark db size accurate?

    According to the readme and common sense, the smaller subsets seem like they would be larger file sizes, but the benchmark shows them getting smaller.

    image

    opened by derekperkins 2
  • Using an external db instead of file.

    Using an external db instead of file.

    Ok, I'm new to String Fizzy matching, go-lang and goagrep so please, bear with me.

    How can I search a list of names in a db with your package?

    I'm using POSTGRESQL db.

    opened by leyluj 1
  • Draft of Release 1.6

    Draft of Release 1.6

    New features:

    • New function, GetMatches, to return up to 100 matches and scores
    • Better error handling
    • More tests (>80% coverage!)
    • Verbosity switch
    • Faster DB building
    • Can be used as library
    opened by schollz 0
  • file resize error on Windows

    file resize error on Windows

    If you have lots of subsets, you can get the error:

    Loading subsets into db...
    173 / 8723  1.98 % 02016/02/04 07:11:18 file resize error: truncate words.db: The requested operation cannot be performed on a file with a user-mapped section open.
    

    Apparently this is a known issue in Bolt, and can be remedied setting NoGrowSync to true.

    bug 
    opened by schollz 0
  • ID vs Search Columns

    ID vs Search Columns

    It'd be helpful if I could search a text column how it works currently, but then return an ID value instead of the whole word, essentially the way that a select box works.

    opened by derekperkins 0
Releases(v1.6)
Owner
Zack
Tinkerer
Zack
Mdfmt - A Markdown formatter that follow the CommonMark. Like gofmt, but for Markdown

Introduction A Markdown formatter that follow the CommonMark. Like gofmt, but fo

杨英明 15 Aug 24, 2022
Develop Sites Faster with HTML-Includer!

HTML Includer Develop Sites Faster with HTML Includer! How to Install Install HTML Includer on your machine: go install github.com/GameWorkstore/html-

Game Workstore 0 Jan 1, 2022
This package provides Go (golang) types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most famously used by app.diagrams.net, the new name of draw.io.

Go Draw - Golang MX This package provides types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most fa

null 2 Aug 30, 2022
A clean, Markdown-based publishing platform made for writers. Write together, and build a community.

WriteFreely is a clean, minimalist publishing platform made for writers. Start a blog, share knowledge within your organization, or build a community

WriteFreely 2.8k Sep 25, 2022
Similar to Anki but uses the actual frequency of words

wordGame A program that uses a frequency-annotated vocabulary list to learn as efficiently as possible. Usage go run wordGame.go -freqTableFname=itali

null 3 Sep 21, 2021
NERV Editor - A simple but peculiar text editor

nerved a simple but peculiar text editor introduction nerved is a text editor bu

kiasaki 5 Apr 12, 2022
Online server tool to made markdown document.

An authenticated(basic auth with digest) docsify server for private markdown documents. Embedded docsify-mermaid as UML tool.

null 0 Oct 16, 2021
Note - A text editor for the Linux Terminal! (Mainly compatible with Arch, because I made it on there)

Note - A text editor for the Linux Terminal! (Mainly compatible with Arch, because I made it on there)

Awesome Sauce 5 Jul 15, 2022
Godown - Markdown to HTML converter made with Go

Godown Godown is a tiny-teeny utility that helps you convert your Markdown files

Kevin Suñer 0 Jan 18, 2022
A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

null 11.9k Sep 28, 2022
Your dev tool to manage /etc/hosts like a pro!

hostctl This tool gives you more control over the use of your hosts file. You can have multiple profiles and switch them on/off as you need. Why? It i

Gustavo 807 Sep 27, 2022
Take screenshots of websites and create PDF from HTML pages using chromium and docker

gochro is a small docker image with chromium installed and a golang based webserver to interact wit it. It can be used to take screenshots of w

Christian Mehlmauer 49 Jul 13, 2022
Frongo is a Golang package to create HTML/CSS components using only the Go language.

Frongo Frongo is a Go tool to make HTML/CSS document out of Golang code. It was designed with readability and usability in mind, so HTML objects are c

Rewan_ 21 Jul 29, 2021
A simple json parser built using golang

jsonparser A simple json parser built using golang Installation: go get -u githu

Krisna Pranav 1 Dec 29, 2021
Match regex group into go struct using struct tags and automatic parsing

regroup Simple library to match regex expression named groups into go struct using struct tags and automatic parsing Installing go get github.com/oris

Ori Seri 120 Sep 29, 2022
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Goribot 一个分布式友好的轻量的 Golang 爬虫框架。 完整文档 | Document !! Warning !! Goribot 已经被迁移到 Gospider|github.com/zhshch2002/gospider。修复了一些调度问题并分离了网络请求部分到另一个仓库。此仓库会继续

null 208 Aug 26, 2022
:triangular_ruler:gofmtmd formats go source code block in Markdown. detects fenced code & formats code using gofmt.

gofmtmd gofmtmd formats go source code block in Markdown. detects fenced code & formats code using gofmt. Installation $ go get github.com/po3rin/gofm

po3rin 90 Apr 3, 2022
Search for Go code using syntax trees

gogrep GO111MODULE=on go get mvdan.cc/gogrep Search for Go code using syntax trees. Work in progress. gogrep -x 'if $x != nil { return $x, $*_ }' In

Daniel Martí 471 Jul 27, 2022
gofontrender renders text with different parameters using the go font renderer

gofontrender This simple program renders text using the go font render. It computes the anti-aliasing by computing the exact pixel coverage with the a

Christophe Meessen 6 Jun 4, 2021