A Go library implementing an FST (finite state transducer)

Related tags

vellum
Overview

vellum vellum

Tests Coverage Status GoDoc Go Report Card License

A Go library implementing an FST (finite state transducer) capable of:

  • mapping between keys ([]byte) and a value (uint64)
  • enumerating keys in lexicographic order

Some additional goals of this implementation:

  • bounded memory use while building the FST
  • streaming out FST data while building
  • mmap FST runtime to support very large FTSs (optional)

Usage

Building an FST

To build an FST, create a new builder using the New() method. This method takes an io.Writer as an argument. As the FST is being built, data will be streamed to the writer as soon as possible. With this builder you MUST insert keys in lexicographic order. Inserting keys out of order will result in an error. After inserting the last key into the builder, you MUST call Close() on the builder. This will flush all remaining data to the underlying writer.

In memory:

  var buf bytes.Buffer
  builder, err := vellum.New(&buf, nil)
  if err != nil {
    log.Fatal(err)
  }

To disk:

  f, err := os.Create("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }
  builder, err := vellum.New(f, nil)
  if err != nil {
    log.Fatal(err)
  }

MUST insert keys in lexicographic order:

err = builder.Insert([]byte("cat"), 1)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("dog"), 2)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("fish"), 3)
if err != nil {
  log.Fatal(err)
}

err = builder.Close()
if err != nil {
  log.Fatal(err)
}

Using an FST

After closing the builder, the data can be used to instantiate an FST. If the data was written to disk, you can use the Open() method to mmap the file. If the data is already in memory, or you wish to load/mmap the data yourself, you can instantiate the FST with the Load() method.

Load in memory:

  fst, err := vellum.Load(buf.Bytes())
  if err != nil {
    log.Fatal(err)
  }

Open from disk:

  fst, err := vellum.Open("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }

Get key/value:

  val, exists, err = fst.Get([]byte("dog"))
  if err != nil {
    log.Fatal(err)
  }
  if exists {
    fmt.Printf("contains dog with val: %d\n", val)
  } else {
    fmt.Printf("does not contain dog")
  }

Iterate key/values:

  itr, err := fst.Iterator(startKeyInclusive, endKeyExclusive)
  for err == nil {
    key, val := itr.Current()
    fmt.Printf("contains key: %s val: %d", key, val)
    err = itr.Next()
  }
  if err != nil {
    log.Fatal(err)
  }

How does the FST get built?

A full example of the implementation is beyond the scope of this README, but let's consider a small example where we want to insert 3 key/value pairs.

First we insert "are" with the value 4.

step1

Next, we insert "ate" with the value 2.

step2

Notice how the values associated with the transitions were adjusted so that by summing them while traversing we still get the expected value.

At this point, we see that state 5 looks like state 3, and state 4 looks like state 2. But, we cannot yet combine them because future inserts could change this.

Now, we insert "see" with value 3. Once it has been added, we now know that states 5 and 4 can longer change. Since they are identical to 3 and 2, we replace them.

step3

Again, we see that states 7 and 8 appear to be identical to 2 and 3.

Having inserted our last key, we call Close() on the builder.

step4

Now, states 7 and 8 can safely be replaced with 2 and 3.

For additional information, see the references at the bottom of this document.

What does the serialized format look like?

We've broken out a separate document on the vellum disk format v1.

What if I want to use this on a system that doesn't have mmap?

The mmap library itself is guarded with system/architecture build tags, but we've also added an additional build tag in vellum. If you'd like to Open() a file based representation of an FST, but not use mmap, you can build the library with the nommap build tag. NOTE: if you do this, the entire FST will be read into memory.

Can I use this with Unicode strings?

Yes, however this implementation is only aware of the byte representation you choose. In order to find matches, you must work with some canonical byte representation of the string. In the future, some encoding-aware traversals may be possible on top of the lower-level byte transitions.

How did this library come to be?

In my work on the Bleve project I became aware of the power of the FST for many search-related tasks. The obvious starting point for such a thing in Go was the mafsa project. While working with mafsa I encountered some issues. First, it did not stream data to disk while building. Second, it chose to use a rune as the fundamental unit of transition in the FST, but I felt using a byte would be more powerful in the end. My hope is that higher-level encoding-aware traversals will be possible when necessary. Finally, as I reported bugs and submitted PRs I learned that the mafsa project was mainly a research project and no longer being maintained. I wanted to build something that could be used in production. As the project advanced more and more techniques from the BurntSushi/fst were adapted to our implementation.

Are there tools to work with vellum files?

Under the cmd/vellum subdirectory, there's a command-line tool which features subcommands that can allow you to create, inspect and query vellum files.

How can I generate a state transition diagram from a vellum file?

The vellum command-line tool has a "dot" subcommand that can emit graphviz dot output data from an input vellum file. The dot file can in turn be converted into an image using graphviz tools. Example...

$ vellum dot myFile.vellum > output.dot
$ dot -Tpng output.dot -o output.png

Related Work

Much credit goes to two existing projects:

Most of the original implementation here started with my digging into the internals of mafsa. As the implementation progressed, I continued to borrow ideas/approaches from the BurntSushi/fst library as well.

For a great introduction to this topic, please read the blog post Index 1,600,000,000 Keys with Automata and Rust

Issues
Owner
bleve
modern text indexing for go - supported and sponsored by Couchbase
bleve
a cheat-sheet for mathematical notation in code form

math-as-code Chinese translation (中文版) Python version (English) This is a reference to ease developers into mathematical notation by showing compariso

Jam3 11.5k Jul 24, 2021
Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.

English ∙ 日本語 ∙ 简体中文 ∙ 繁體中文 | العَرَبِيَّة‎ ∙ বাংলা ∙ Português do Brasil ∙ Deutsch ∙ ελληνικά ∙ עברית ∙ Italiano ∙ 한국어 ∙ فارسی ∙ Polski ∙ русский язы

Donne Martin 139.2k Jul 23, 2021
A reimplementation of AlphaGo in Go (specifically AlphaZero)

A reimplementation of AlphaGo in Go (specifically AlphaZero)

Gorgonia 187 Jul 10, 2021
Library for multi-armed bandit selection strategies, including efficient deterministic implementations of Thompson sampling and epsilon-greedy.

Mab Multi-Armed Bandits Go Library Description Installation Usage Creating a bandit and selecting arms Numerical integration with numint Documentation

Stitch Fix Technology 18 Jun 23, 2021
Ensembles of decision trees in go/golang.

CloudForest Google Group Fast, flexible, multi-threaded ensembles of decision trees for machine learning in pure Go (golang). CloudForest allows for a

Ryan Bressler 687 Jul 17, 2021
Distributed hyperparameter optimization framework, inspired by Optuna.

Goptuna Distributed hyperparameter optimization framework, inspired by Optuna [1]. This library is particularly designed for machine learning, but eve

Masashi SHIBATA 184 Jul 10, 2021
Bigmachine is a library for self-managing serverless computing in Go

Bigmachine Bigmachine is a toolkit for building self-managing serverless applications in Go. Bigmachine provides an API that lets a driver process for

GRAIL 170 Jun 18, 2021
Neural Network for Go.

gonet gonet is a Go module implementing multi-layer Neural Network. Install Install the module with: go get github.com/dathoangnd/gonet Import it in

Dat Hoang 67 Jul 7, 2021
Evolutionary Algorithms in Go

Evo Evo is a framework for implementing evolutionary algorithms in Go. go get github.com/cbarrick/evo Documentation https://godoc.org/github.com/cbar

Chris Barrick 108 Jul 11, 2021
Path to a Software Architect

Contents What is a Software Architect? Levels of Architecture Typical Activities Important Skills (1) Design (2) Decide (3) Simplify (4) Code (5) Docu

Justin Miller 7.1k Jul 27, 2021
k-means clustering algorithm implementation written in Go

kmeans k-means clustering algorithm implementation written in Go What It Does k-means clustering partitions a multi-dimensional data set into k cluste

Christian Muehlhaeuser 346 Jul 17, 2021
Object detection on multiple datasets with an automatically learned unified label space.

An object detector trained on multiple large-scale datasets with a unified label space; Winning solution of ECCV 2020 Robust Vision Challenges.

Xingyi Zhou 259 Jul 21, 2021
Gorgonia is a library that helps facilitate machine learning in Go.

Gorgonia is a library that helps facilitate machine learning in Go. Write and evaluate mathematical equations involving multidimensional arrays easily

Gorgonia 4.1k Jul 27, 2021
Gorgonia is a library that helps facilitate machine learning in Go.

Gorgonia is a library that helps facilitate machine learning in Go. Write and evaluate mathematical equations involving multidimensional arrays easily

Gorgonia 4.1k Jul 19, 2021