High Performance Porter2 Stemmer

Overview

porter2

GoDoc

Porter2 implements the english Porter2 stemmer. It is written completely using finite state machines to do suffix comparison, rather than the string-based or tree-based approaches. As a result, it is 660% faster compare to string comparison-based approach.

import "github.com/surgebase/porter2"

fmt.Println(porter2.Stem("seaweed")) // should get seawe

This implementation has been successfully validated with the dataset from http://snowball.tartarus.org/algorithms/english/

Performance

This implementation by far has the highest performance of the various Go-based implementations, AFAICT. I tested a few of the implementations and the results are below.

Implementation Time Algorithm
surgebase 319.009358ms Porter2
dchest 2.106912401s Porter2
kljensen 5.725917198s Porter2

To run the test again, you can run cmd/compare/compare.go (go run compare.go).

State Machines

Most of the implementations, like the ones in the table above, rely completely on suffix string comparison. Basically there's a list of suffixes, and the code will loop through the list to see if there's a match. Given most of the time you are looking for the longest match, so you order the list so the longest is the first one. So if you are luckly, the match will be early on the list. But regardless that's a huge performance hit.

This implementation is based completely on finite state machines to perform suffix comparison. You compare each chacter of the string starting at the last character going backwards. The state machines at each step will determine what the longest suffix is. You can think of the state machine as an unrolled tree.

However, writing large state machines can be very error-prone. So I wrote a quick tool to generate most of the state machines. The tool basically takes a file of suffixes, creates a tree, then unrolls the tree by dumping each of the nodes.

You can run the tool by go run suffixfsm.go <filename>.

License

Copyright (c) 2014 Dataence, LLC. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Issues
  • markR1R2 might be improved with string ==

    markR1R2 might be improved with string ==

    You are currently doing char by char equality checking. If you checked a subslice for equality to e.g. "gener" the runtime will use highly the optimized memcmp Instead of lots if one byte reads. But you run the risk of making a lit of garbage, have to check that too, I guess. ☺

    opened by jeffallen 5
  • Some words (actually non-words) causes panic

    Some words (actually non-words) causes panic

    Stem("aed") causes an error (a panic). Seems to be from step1b where rs[len(rs)-2] is used as an index, but len(rs)=1. From my testing, it only happens with a vowel followed by a word ending that would be stripped, like "aed" or "aing" or "eing", etc. Of course, those are not really words! But in real life those can show up in text anyway. Note: despite that, porter2 is an excellent product and is very fast.

    opened by john5000 1
Releases(v1.0)
🔑A high performance Key/Value store written in Go with a predictable read/write performance and high throughput. Uses a Bitcask on-disk layout (LSM+WAL) similar to Riak.

bitcask A high performance Key/Value store written in Go with a predictable read/write performance and high throughput. Uses a Bitcask on-disk layout

James Mills 7 Apr 15, 2022
the pluto is a gateway new time, high performance, high stable, high availability, easy to use

pluto the pluto is a gateway new time, high performance, high stable, high availability, easy to use Acknowledgments thanks nbio for providing low lev

mobus 2 Sep 19, 2021
Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Stemmer package for Go Stemmer package provides an interface for stemmers and includes English, German and Dutch stemmers as sub-packages: porter2 sub

Dmitry Chestnykh 51 Jan 23, 2022
go-fastdfs 是一个简单的分布式文件系统(私有云存储),具有无中心、高性能,高可靠,免维护等优点,支持断点续传,分块上传,小文件合并,自动同步,自动修复。Go-fastdfs is a simple distributed file system (private cloud storage), with no center, high performance, high reliability, maintenance free and other advantages, support breakpoint continuation, block upload, small file merge, automatic synchronization, automatic repair.(similar fastdfs).

中文 English 愿景:为用户提供最简单、可靠、高效的分布式文件系统。 go-fastdfs是一个基于http协议的分布式文件系统,它基于大道至简的设计理念,一切从简设计,使得它的运维及扩展变得更加简单,它具有高性能、高可靠、无中心、免维护等优点。 大家担心的是这么简单的文件系统,靠不靠谱,可不

小张 3.2k Jun 30, 2022
LinDB is an open-source Time Series Database which provides high performance, high availability and horizontal scalability.

LinDB is an open-source Time Series Database which provides high performance, high availability and horizontal scalability. LinDB stores all monitoring data of ELEME Inc, there is 88TB incremental writes per day and 2.7PB total raw data.

LinDB 2.3k Jun 29, 2022
Gin is a HTTP web framework written in Go (Golang). It features a Martini-like API with much better performance -- up to 40 times faster. If you need smashing performance, get yourself some Gin.

Gin Web Framework Gin is a web framework written in Go (Golang). It features a martini-like API with performance that is up to 40 times faster thanks

Gin-Gonic 60.5k Jun 27, 2022
Package ring provides a high performance and thread safe Go implementation of a bloom filter.

ring - high performance bloom filter Package ring provides a high performance and thread safe Go implementation of a bloom filter. Usage Please see th

Tanner Ryan 126 May 10, 2022
High-performance minimalist queue implemented using a stripped-down lock-free ringbuffer, written in Go (golang.org)

This project is no longer maintained - feel free to fork the project! gringo A high-performance minimalist queue implemented using a stripped-down loc

Darren Elwood 126 Jun 8, 2022
A decentralized, trusted, high performance, SQL database with blockchain features

中文简介 CovenantSQL(CQL) is a Byzantine Fault Tolerant relational database built on SQLite: ServerLess: Free, High Availabile, Auto Sync Database Service

CovenantSQL 1.3k Jun 21, 2022
A high performance NoSQL Database Server powered by Go

LedisDB Ledisdb is a high-performance NoSQL database library and server written in Go. It's similar to Redis but store data in disk. It supports many

LedisDB 3.9k Jun 25, 2022
A high-performance MySQL proxy

kingshard 中文主页 Overview kingshard is a high-performance proxy for MySQL powered by Go. Just like other mysql proxies, you can use it to split the read

Fei Chen 6k Jun 28, 2022
pREST (PostgreSQL REST), simplify and accelerate development, ⚡ instant, realtime, high-performance on any Postgres application, existing or new

pREST pREST (PostgreSQL REST), simplify and accelerate development, instant, realtime, high-performance on any Postgres application, existing or new P

pREST 3.3k Jun 29, 2022
High-performance framework for building redis-protocol compatible TCP servers/services

Redeo The high-performance Swiss Army Knife for building redis-protocol compatible servers/services. Parts This repository is organised into multiple

Black Square Media 414 Jun 22, 2022
A feature complete and high performance multi-group Raft library in Go.

Dragonboat - A Multi-Group Raft library in Go / 中文版 News 2021-01-20 Dragonboat v3.3 has been released, please check CHANGELOG for all changes. 2020-03

lni 4.3k Jun 23, 2022
High performance, distributed and low latency publish-subscribe platform.

Emitter: Distributed Publish-Subscribe Platform Emitter is a distributed, scalable and fault-tolerant publish-subscribe platform built with MQTT proto

emitter 3.3k Jun 26, 2022
High-Performance server for NATS, the cloud native messaging system.

NATS is a simple, secure and performant communications system for digital systems, services and devices. NATS is part of the Cloud Native Computing Fo

NATS - The Cloud Native Messaging System 11.1k Jun 29, 2022
Lightweight, facility, high performance golang based game server framework

Nano Nano is an easy to use, fast, lightweight game server networking library for Go. It provides a core network architecture and a series of tools an

Lonng 2k Jun 21, 2022
efaceconv - Code generation tool for high performance conversion from interface{} to immutable type without allocations.

efaceconv High performance conversion from interface{} to immutable types without additional allocations This is tool for go generate and common lib (

Ivan 50 May 14, 2022
🐜🐜🐜 ants is a high-performance and low-cost goroutine pool in Go, inspired by fasthttp./ ants 是一个高性能且低损耗的 goroutine 池。

A goroutine pool for Go English | ???? 中文 ?? Introduction Library ants implements a goroutine pool with fixed capacity, managing and recycling a massi

Andy Pan 8.6k Jun 26, 2022
Minimalistic and High-performance goroutine worker pool written in Go

pond Minimalistic and High-performance goroutine worker pool written in Go Motivation This library is meant to provide a simple way to limit concurren

Alejandro Durante 557 Jun 22, 2022
beego is an open-source, high-performance web framework for the Go programming language.

Beego Beego is used for rapid development of enterprise application in Go, including RESTful APIs, web apps and backend services. It is inspired by To

astaxie 481 Jul 2, 2022
High performance, minimalist Go web framework

Supported Go versions As of version 4.0.0, Echo is available as a Go module. Therefore a Go version capable of understanding /vN suffixed imports is r

LabStack LLC 22.7k Jun 24, 2022
Gearbox :gear: is a web framework written in Go with a focus on high performance

gearbox ⚙️ is a web framework for building micro services written in Go with a focus on high performance. It's built on fasthttp which is up to 10x fa

Gearbox 645 Jun 21, 2022
hiboot is a high performance web and cli application framework with dependency injection support

Hiboot - web/cli application framework About Hiboot is a cloud native web and cli application framework written in Go. Hiboot is not trying to reinven

hidevops.io 175 Jun 12, 2022
Fast HTTP package for Go. Tuned for high performance. Zero memory allocations in hot paths. Up to 10x faster than net/http

fasthttp Fast HTTP implementation for Go. Currently fasthttp is successfully used by VertaMedia in a production serving up to 200K rps from more than

Aliaksandr Valialkin 17.9k Jun 23, 2022
High performance async-io(proactor) networking for Golang。golangのための高性能非同期io(proactor)ネットワーキング

gaio Introduction 中文介绍 For a typical golang network program, you would first conn := lis.Accept() to get a connection and go func(net.Conn) to start a

xtaci 440 Jun 28, 2022
🚀Gev is a lightweight, fast non-blocking TCP network library based on Reactor mode. Support custom protocols to quickly and easily build high-performance servers.

gev 中文 | English gev is a lightweight, fast non-blocking TCP network library based on Reactor mode. Support custom protocols to quickly and easily bui

徐旭 1.4k Jun 26, 2022
Gmqtt is a flexible, high-performance MQTT broker library that fully implements the MQTT protocol V3.1.1 and V5 in golang

中文文档 Gmqtt News: MQTT V5 is now supported. But due to those new features in v5, there area lots of breaking changes. If you have any migration problem

null 671 Jun 22, 2022