A native Go clean room implementation of the Porter Stemming algorithm.

Overview

Go Porter Stemmer

A native Go clean room implementation of the Porter Stemming Algorithm.

This algorithm is of interest to people doing Machine Learning or Natural Language Processing (NLP).

This is NOT a port. This is a native Go implementation from the human-readable description of the algorithm.

I've tried to make it (more) efficient by NOT internally using string's, but instead internally using []rune's and using the same (array) buffer used by the []rune slice (and sub-slices) at all steps of the algorithm.

For Porter Stemmer algorithm, see:

http://tartarus.org/martin/PorterStemmer/def.txt (URL #1)

http://tartarus.org/martin/PorterStemmer/ (URL #2)

Departures

Also, since when I initially implemented it, it failed the tests at...

http://tartarus.org/martin/PorterStemmer/voc.txt (URL #3)

http://tartarus.org/martin/PorterStemmer/output.txt (URL #4)

... after reading the human-readble text over and over again to try to figure out what the error I made was (and doing all sorts of things to debug it) I came to the conclusion that the some of these tests were wrong according to the human-readable description of the algorithm.

This led me to wonder if maybe other people's code that was passing these tests had rules that were not in the human-readable description. Which led me to look at the source code here...

http://tartarus.org/martin/PorterStemmer/c.txt (URL #5)

... When I looked there I noticed that there are some items marked as a "DEPARTURE", which differ from the original algorithm. (There are 2 of these.)

I implemented these departures, and the tests at URL #3 and URL #4 all passed.

Usage

To use this Golang library, use with something like:

package main

import (
  "fmt"
  "github.com/reiver/go-porterstemmer"
)

func main() {
  
  word := "Waxes"
  
  stem := porterstemmer.StemString(word)
  
  fmt.Printf("The word [%s] has the stem [%s].\n", word, stem)
}

Alternatively, if you want to be a bit more efficient, use []rune slices instead, with code like:

package main

import (
  "fmt"
  "github.com/reiver/go-porterstemmer"
)

func main() {
  
  word := []rune("Waxes")
  
  stem := porterstemmer.Stem(word)
  
  fmt.Printf("The word [%s] has the stem [%s].\n", string(word), string(stem))
}

Although NOTE that the above code may modify original slice (named "word" in the example) as a side effect, for efficiency reasons. And that the slice named "stem" in the example above may be a sub-slice of the slice named "word".

Also alternatively, if you already know that your word is already lowercase (and you don't need this library to lowercase your word for you) you can instead use code like:

package main

import (
  "fmt"
  "github.com/reiver/go-porterstemmer"
)

func main() {
  
  word := []rune("waxes")
  
  stem := porterstemmer.StemWithoutLowerCasing(word)
  
  fmt.Printf("The word [%s] has the stem [%s].\n", string(word), string(stem))
}

Again NOTE (like with the previous example) that the above code may modify original slice (named "word" in the example) as a side effect, for efficiency reasons. And that the slice named "stem" in the example above may be a sub-slice of the slice named "word".

Comments
  • Renamed test file to add _test

    Renamed test file to add _test

    Without this, any library that imports pkg porterstemmer will also import testing, meaning that any application that imports pkg porterstemmer will have all the testing flags, which certainly appears to not be desired behaviour.

    opened by cuevasclemente 1
  • Fix panics

    Fix panics

    This pull-request is related to issue #4

    Added a fuzz unit test to identify which inputs (less than 6 chars long) result in a panic. Moved check for measure = 1 before creating slice to avoid panic in rule 5a.

    opened by mschoch 1
  • removed unused import, fixes test compilation

    removed unused import, fixes test compilation

    prior to this I get:

    $ go test -v
    # github.com/reiver/go-porterstemmer
    ./porterstemmer_fixes_test.go:6: imported and not used: "fmt"
    FAIL    github.com/reiver/go-porterstemmer [build failed]
    
    opened by mschoch 1
  • StemString(

    StemString("ion") causes runtime exception

    Test Program

    package main
    
    import "fmt"
    import "github.com/reiver/go-porterstemmer"
    
    func main() {
        word := porterstemmer.StemString("ion")
        fmt.Println(word)
    }
    

    Output

    panic: runtime error: index out of range
    
    goroutine 1 [running]:
    github.com/reiver/go-porterstemmer.step4(0xc20003f2c0, 0x3, 0x3, 0xc20003f2c0, 0x3, ...)
        /home/oliver/go/src/github.com/reiver/go-porterstemmer/porterstemmer.go:691 +0x1550
    github.com/reiver/go-porterstemmer.StemWithoutLowerCasing(0xc20003f2c0, 0x3, 0x3, 0x4e0143, 0x4e0142, ...)
        /home/oliver/go/src/github.com/reiver/go-porterstemmer/porterstemmer.go:895 +0xfd
    github.com/reiver/go-porterstemmer.Stem(0xc20003f2c0, 0x3, 0x3, 0x3, 0x3, ...)
        /home/oliver/go/src/github.com/reiver/go-porterstemmer/porterstemmer.go:868 +0xc2
    github.com/reiver/go-porterstemmer.StemString(0x4e0140, 0x3, 0x45ebb5, 0x4e8070)
        /home/oliver/go/src/github.com/reiver/go-porterstemmer/porterstemmer.go:839 +0x51
    main.main()
        /home/oliver/go/src/thedarkesttimeline/test.go:7 +0x32
    
    goroutine 2 [runnable]:
    exit status 2
    
    opened by OliverCardoza 1
  • Bug crashes on certain inputs

    Bug crashes on certain inputs

    Ex: "eeb"

    Though .Stem checks for words shorter than 3 letters before steps 1-4 execute. By the time it reaches step5a eeb becomes e and there is an index out of bounds error.

    opened by katelee168 8
Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Song2 Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Masaya Watanabe 52 Oct 25, 2022
k-means clustering algorithm implementation written in Go

kmeans k-means clustering algorithm implementation written in Go What It Does k-means clustering partitions a multi-dimensional data set into k cluste

Christian Muehlhaeuser 414 Dec 6, 2022
Menggunakan clean architecture dengan domain driven design

Golang project Menggunakan clean architecture dengan domain driven design Domain - Entity Sebuah model untuk entitas Domain - Model Hampir sama dengan

Binsar Dwi Jasuma 0 Jan 4, 2022
Clean Architecture With Golang

Clean Architecture With Golang When init a new project go mod init github.com/samuelterra22/clean-architecture-go Run testes go test ./... Generate a

Samuel Terra 2 Aug 2, 2022
Genetic Algorithm and Particle Swarm Optimization

evoli Genetic Algorithm and Particle Swarm Optimization written in Go Example Problem Given f(x,y) = cos(x^2 * y^2) * 1/(x^2 * y^2 + 1) Find (x,y) suc

Guillaume Simonneau 26 Dec 22, 2022
Golang Genetic Algorithm

goga Golang implementation of a genetic algorithm. See ./examples for info on how to use the library. Overview Goga is a genetic algorithm solution wr

null 177 Dec 19, 2022
An iterative algorithm to generate high-quality triangulated images.

An iterative algorithm to generate high quality triangulated images. Introduction Triangula uses a modified genetic algorithm to triangulate images. I

null 3.8k Dec 29, 2022
a* pathfinding algorithm written in go

astar a* (a-star) pathfinding algorithm written in go Wikipedia: EN: A* search algorithm DE: A*-Algorithmus Install go get github.com/jpierer/[email protected]

Julian Pierer 26 Mar 21, 2022
A Kubernetes Native Batch System (Project under CNCF)

Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workloa

Volcano 2.8k Jan 9, 2023
Katib is a Kubernetes-native project for automated machine learning (AutoML).

Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architec

Kubeflow 1.3k Jan 2, 2023
k-modes and k-prototypes clustering algorithms implementation in Go

go-cluster GO implementation of clustering algorithms: k-modes and k-prototypes. K-modes algorithm is very similar to well-known clustering algorithm

e-Xpert Solutions 37 Nov 29, 2022
An implementation of Neural Turing Machines

Neural Turing Machines Package ntm implements the Neural Turing Machine architecture as described in A.Graves, G. Wayne, and I. Danihelka. arXiv prepr

Fumin 398 Sep 13, 2022
Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch

EGNN - Pytorch Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch. May be eventually used for Alphafold2 replication.

Phil Wang 257 Dec 23, 2022
A high performance go implementation of Wappalyzer Technology Detection Library

wappalyzergo A high performance port of the Wappalyzer Technology Detection Library to Go. Inspired by https://github.com/rverton/webanalyze. Features

ProjectDiscovery 378 Jan 8, 2023
Go implementation of the yolo v3 object detection system

Go YOLO V3 This repository provides a plug and play implementation of the Yolo V3 object detection system in Go, leveraging gocv. Prerequisites Since

Wim Spaargaren 58 Dec 14, 2022
Golang k-d tree implementation with duplicate coordinate support

Golang k-d tree implementation with duplicate coordinate support

DownFlux 46 Nov 9, 2022
A discord bot that mumbles the audio from another room into your room (WIP)

DiscordGo Voice Receive Example This example experiments with receiving voice data from Discord. It joins a specified voice channel, listens for 10 se

Eric Volpert 0 Dec 11, 2021
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Aaron Groves 29 Sep 27, 2022
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Aaron Groves 29 Sep 27, 2022
Eunomia is a distributed application framework that support Gossip protocol, QuorumNWR algorithm, PBFT algorithm, PoW algorithm, and ZAB protocol and so on.

Introduction Eunomia is a distributed application framework that facilitates developers to quickly develop distributed applications and supports distr

Cong 2 Sep 28, 2021