k-modes and k-prototypes clustering algorithms implementation in Go

Overview

go-cluster

GoDoc License GoReport Travis cover.run go

GO implementation of clustering algorithms: k-modes and k-prototypes.

K-modes algorithm is very similar to well-known clustering algorithm k-means. The difference is how the distance is computed. In k-means Euclidean distance between two vectors is most commonly used. While it works well for numerical, continuous data it is not suitable to use it with categorical data as it is impossible to compute the distance between values like ‘Europe’ and ‘Africa’. This is why in k-modes, the Hamming distance between vectors is used - it shows how many elements of two vectors is different. It is a good alternative for one-hot encoding while dealing with large number of categories for one feature. K-prototypes is used to cluster mixed data (both categorical and numerical).

Implementation of algorithms is based on papers: HUANG97, HUANG98, CAO09 and partially inspired by python implementation of same algorithms: KMODES.

Installation

go get -u gopkg.in/e-XpertSolutions/go-cluster.v1

Usage

This is basic configuration and usage of KModes and KPrototypes algorithms. For more information please refer to the documentation.

package main

import (
    "fmt"
    "github.com/e-XpertSolutions/go-cluster/cluster"
)

func main() {

    //input categorical data first must be dictionary-encoded to numbers - for example for values
    //"blue", "red", "green" it can be 1,2,3

    data := cluster.NewDenseMatrix(lineNumber, columnNumber, rawData)
    newData := cluster.NewDenseMatrix(newLineNumber, newColumnNumber, newRawData)


    //input parameters for the algorithm

    //distance and initialization functions may be chosen from the package or one may use 
    //custom functions with proper arguments
    distanceFunction := cluster.WeightedHammingDistance
    initializationFunction := cluster.InitCao

    //number of clusters and maximum number of iterations 
    clustersNumber := 5
    maxIteration := 20

    //weight vector - used to set importance of the features, bigger number means greater 
    //contribution to the cost function
    //vector must be of the same length as the number of features in dataset
    //it is not compulsory, if 'nil' then all features are treated equally (weight = 1)  
    weights := []float64{1,1,2}
    wvec := [][]float64{weights}

    //path to file where model will be saved or loaded from using LoadModel(), SaveModel()
    //if no need to load or save the model, can be set to empty string
    path = "km.txt"

    //KModes algorithm
    //initialization
    km := cluster.NewKModes(distanceFunction, initializationFunction, clustersNumber, 1, 
    maxIteration, wvec, "km.txt")


    //training
    //after training it is possible to access clusters centers vectors and computed labels
    //using km.ClusterCentroids and km.Labels
    err := km.FitModel(data)
    if err != nil {
        fmt.Println(err)
    }

    //predicting labels for new data
    newLabels, err := km.Predict(newData)
    if err != nil {
        fmt.Println(err)
    }


    //KPrototypes algorithm
    //it needs two more parameters than k-modes:
    //categorical - vector with numbers indicating columns with categorical features
    //gamma - float number, importance of cost contribution for numerical values
    categorical := []int{1} // means that only column number one contains categorical data
    gamma := 0.2 //cost from distance function for numerical data will be multiplied by 0.2

    //initialization
    kp := cluster.NewKPrototypes(distanceFunction, initializationFunction, categorical, 
    clustersNumber, 1, maxIteration, wvec, gamma, "km.txt")

    //training
    err := kp.FitModel(data)
    if err != nil {
        fmt.Println(err)
    }

    //predicting labels for new data
    newLabelsP, err := kp.Predict(newData)
    if err != nil {
        fmt.Println(err)
    }
}

Contributing

Contributions are greatly appreciated. The project follows the typical GitHub pull request model for contribution.

License

The sources are release under a BSD 3-Clause License. The full terms of that license can be found in LICENSE file of this repository.

References

[HUANG97]: Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997.

[HUANG98] Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998.

[CAO09] Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 10223-10228., 2009.

[KMODES] Python implementation of k-modes: https://github.com/nicodv/kmodes

Issues
  • Distance formula is wrong

    Distance formula is wrong

    https://github.com/e-XpertSolutions/go-cluster/blob/master/cluster/kprototypes.go#L276

    should be

    dist := km.Gamma*distCat + distNum

    as stated in Huang, Z. 1997

    γ is a weight for categorical attributes for cluster l.

    opened by jcaberio 1
Releases(v1.0.0)
Owner
e-Xpert Solutions
e-Xpert Solutions
a simple & tiny scrapy clustering solution, considered a drop-in replacement for scrapyd

scrapyr a very simple scrapy orchestrator engine that could be distributed among multiple machines to build a scrapy cluster, under-the-hood it uses r

Mohammed Al Ashaal 50 Nov 24, 2021
Genetic Algorithms library written in Go / golang

Description Genetic Algorithms for Go/Golang Install $ go install git://github.com/thoj/go-galib.git Compiling examples: $ git clone git://github.com

Thomas Jager 192 May 26, 2022
Collaborative Filtering (CF) Algorithms in Go!

Go Recommend Recommendation algorithms (Collaborative Filtering) in Go! Background Collaborative Filtering (CF) is oftentimes used for item recommenda

Tim Kaye 185 May 24, 2022
Evolutionary Algorithms in Go

Evo Evo is a framework for implementing evolutionary algorithms in Go. go get github.com/cbarrick/evo Documentation https://godoc.org/github.com/cbar

Chris Barrick 112 Apr 18, 2022
Genetic algorithms using Golang Generics

Package genetic Package genetic implements genetic algorithms using Golang's Gen

Konnor Klashinsky 2 Jun 18, 2022
A native Go clean room implementation of the Porter Stemming algorithm.

Go Porter Stemmer A native Go clean room implementation of the Porter Stemming Algorithm. This algorithm is of interest to people doing Machine Learni

Charles Iliya Krempeaux 179 Oct 15, 2021
An implementation of Neural Turing Machines

Neural Turing Machines Package ntm implements the Neural Turing Machine architecture as described in A.Graves, G. Wayne, and I. Danihelka. arXiv prepr

Fumin 397 Apr 4, 2022
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Aaron Groves 28 Jan 23, 2022
Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Song2 Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Masaya Watanabe 49 Apr 24, 2022
Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch

EGNN - Pytorch Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch. May be eventually used for Alphafold2 replication.

Phil Wang 219 Jun 25, 2022
A high performance go implementation of Wappalyzer Technology Detection Library

wappalyzergo A high performance port of the Wappalyzer Technology Detection Library to Go. Inspired by https://github.com/rverton/webanalyze. Features

ProjectDiscovery 266 Jun 23, 2022
Go implementation of the yolo v3 object detection system

Go YOLO V3 This repository provides a plug and play implementation of the Yolo V3 object detection system in Go, leveraging gocv. Prerequisites Since

Wim Spaargaren 53 May 4, 2022
Golang k-d tree implementation with duplicate coordinate support

Golang k-d tree implementation with duplicate coordinate support

DownFlux 45 Apr 13, 2022
The open source, end-to-end computer vision platform. Label, build, train, tune, deploy and automate in a unified platform that runs on any cloud and on-premises.

End-to-end computer vision platform Label, build, train, tune, deploy and automate in a unified platform that runs on any cloud and on-premises. onepa

Onepanel, Inc. 605 Jun 24, 2022
Go types, funcs, and utilities for working with cards, decks, and evaluating poker hands (Holdem, Omaha, Stud, more)

cardrank.io/cardrank Package cardrank.io/cardrank provides a library of types, funcs, and utilities for working with playing cards, decks, and evaluat

null 50 Jun 21, 2022
Genetic Algorithm and Particle Swarm Optimization

evoli Genetic Algorithm and Particle Swarm Optimization written in Go Example Problem Given f(x,y) = cos(x^2 * y^2) * 1/(x^2 * y^2 + 1) Find (x,y) suc

Guillaume Simonneau 23 Jun 12, 2022
Probability distributions and associated methods in Go

godist godist provides some Go implementations of useful continuous and discrete probability distributions, as well as some handy methods for working

Edd Robinson 33 Apr 1, 2022
On-line Machine Learning in Go (and so much more)

goml Golang Machine Learning, On The Wire goml is a machine learning library written entirely in Golang which lets the average developer include machi

Conner DiPaolo 1.3k Jun 19, 2022
Bayesian text classifier with flexible tokenizers and storage backends for Go

Shield is a bayesian text classifier with flexible tokenizer and backend store support Currently implemented: Redis backend English tokenizer Example

Erik Aigner 152 Jun 23, 2022