k-means clustering algorithm implementation written in Go

Overview

kmeans

Latest Release Build Status Coverage Status Go ReportCard GoDoc

k-means clustering algorithm implementation written in Go

What It Does

k-means clustering partitions a multi-dimensional data set into k clusters, where each data point belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

kmeans animation

When Should I Use It?

  • When you have numeric, multi-dimensional data sets
  • You don't have labels for your data
  • You know exactly how many clusters you want to partition your data into

Example

import (
	"github.com/muesli/kmeans"
	"github.com/muesli/clusters"
)

// set up a random two-dimensional data set (float64 values between 0.0 and 1.0)
var d clusters.Observations
for x := 0; x < 1024; x++ {
	d = append(d, clusters.Coordinates{
		rand.Float64(),
		rand.Float64(),
	})
}

// Partition the data points into 16 clusters
km := kmeans.New()
clusters, err := km.Partition(d, 16)

for _, c := range clusters {
	fmt.Printf("Centered at x: %.2f y: %.2f\n", c.Center[0], c.Center[1])
	fmt.Printf("Matching data points: %+v\n\n", c.Observations)
}

Complexity

If k (the amount of clusters) and d (the dimensions) are fixed, the problem can be exactly solved in time O(ndk+1), where n is the number of entities to be clustered.

The running time of the algorithm is O(nkdi), where n is the number of d-dimensional vectors, k the number of clusters and i the number of iterations needed until convergence. On data that does have a clustering structure, the number of iterations until convergence is often small, and results only improve slightly after the first dozen iterations. The algorithm is therefore often considered to be of "linear" complexity in practice, although it is in the worst case superpolynomial when performed until convergence.

Options

You can greatly reduce the running time by adjusting the required delta threshold. With the following options the algorithm finishes when less than 5% of the data points shifted their cluster assignment in the last iteration:

km, err := kmeans.NewWithOptions(0.05, nil)

The default setting for the delta threshold is 0.01 (1%).

If you are working with two-dimensional data sets, kmeans can generate beautiful graphs (like the one above) for each iteration of the algorithm:

km, err := kmeans.NewWithOptions(0.01, kmeans.SimplePlotter{})

Careful: this will generate PNGs in your current working directory.

You can write your own plotters by implementing the kmeans.Plotter interface.

Comments
  • Evaluation and assessment

    Evaluation and assessment

    Nice work on the kmeans implementation! Do you plan on adding functions for evaluating the clusters? Here is a wikipedia article that has some metrics that may be worth adding.

    https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment

    opened by vosmith 6
  • Add benchmarks for partitioning with different Cluster sizes

    Add benchmarks for partitioning with different Cluster sizes

    Based on issue #2.
    I'm only benchmarking Kmeans.Partition in this PR. I consider also adding cases for Clusters.Nearest and Points.Mean but Kmeans.Partition already exercises those code paths. The choice of cluster sizes in these benchmarks -- 32, 512, 4096, 65536 -- is also somewhat arbitrary but I'm open to changing it if you can think of some more representative values.

    opened by philangist 5
  • feat: less dependencies

    feat: less dependencies

    go-colorful is only used in the example packages, and its not needed anywhere else.

    previous versions of go-colorful were also bringing a sql mock package as a dependency, which has been fixed... this also upgrades that dep, fwiw.

    we can remove that dep by making the examples folder another go module, which is what this pr does.

    Signed-off-by: Carlos A Becker [email protected]

    opened by caarlos0 2
  • Fix indirect dependency go-sdk

    Fix indirect dependency go-sdk

    Currently (6 dec 2020) this module cannot be imported in a Go 1.15.5 based project using modules because the go-sdk indirect dependency (which I was not able to find as a dependency of this module's dependencies) does not appear to have a v2.0.0 version, making go mod and go get refuse to work properly, even when generting a new go.mod for my project (paletter), whith the following error:

    go: github.com/muesli/[email protected] requires
    	github.com/blend/[email protected]+incompatible: reading github.com/blend/go-sdk/go.mod at revision v2.0.0: unknown revision v2.0.0
    

    As a sidenote, go-chart v2 as a module has to imported as github.com/wcharczuk/go-chart/v2

    bug 
    opened by Baldomo 1
  • Add support to metadata

    Add support to metadata

    In certain cases, we want to apply the algorithm to get the clusters but, once we have the observations, we want other information associated with the observations of the clusters.

    Imagine that we have a lot of circles. Each circle has a unique colour and a radius. We want to apply clustering by the radius. In the end, we want to check in which cluster is the green circle.

    Is it possible?

    opened by luisfmelo 1
  • Issue in plotter.go file

    Issue in plotter.go file

    When I install lantern I get an error. I think in plotter.go, where the chart.Style struct is used, the first attribute must be changed because the "Show" attribute changed to "Hidden" in wcharczuk/go-chart.

    opened by PhilVince96 1
  • K means doesn't build. Seems that dependency has changed API

    K means doesn't build. Seems that dependency has changed API

    github.com/muesli/kmeans

    ../../../../muesli/kmeans/plotter.go:45:5: unknown field 'Show' in struct literal of type chart.Style ../../../../muesli/kmeans/plotter.go:58:4: unknown field 'Show' in struct literal of type chart.Style ../../../../muesli/kmeans/plotter.go:73:5: unknown field 'Show' in struct literal of type chart.Style ../../../../muesli/kmeans/plotter.go:78:5: unknown field 'Show' in struct literal of type chart.Style

    opened by evanoberholster 1
  • Random_state: Control of random state

    Random_state: Control of random state

    Thank you very much for your work, I would like to ask if there is a parameter in the algorithm similar to sklearn random_state in Python, which can control the random state, because my output results are always inconsistent due to the randomness of the initial cluster center, thank you for taking the time to see my question!

    opened by slomo28 1
  • breaking: add seed to NewWithOptions

    breaking: add seed to NewWithOptions

    hello @muesli, I tried many approaches but did not see an obvious way to keep backwards compatibility, so I am proposing this breaking change. I am open to other approaches.

    I was a bit torn on this one, because, although there is a constructor, the seed of the default PRNG is set each time that Partition is called (via clusters.new), so I wanted a way to keep the existing semantics. I came out with the idea of passing the seed to the NewWithOptions ctor and to use the zero value to mean default behavior.

    If a client explicitly sets the seed, then each time it calls Partition() on a given Kmean struct, the PRG will be reset to that seed.

    I have to say I am not completely convinced. Opinions?

    Together with https://github.com/muesli/clusters/pull/1, closes #19.

    opened by marco-m 0
  • Feature Request: Allow deterministic random seed value on initialization.

    Feature Request: Allow deterministic random seed value on initialization.

    Hi,

    Currently the initialization of Clusters slice randomly seeds their initial positions.

    I would like to request for a new initialization method with option to specify the seed value in order to obtain a deterministic result from the Partition function.

    Thank you.

    enhancement 
    opened by c2wei 3
  • Feature Request: Allow observations with gaps

    Feature Request: Allow observations with gaps

    Hi,

    do you mind if we add the option to your kmeans to ignore gaps in observations.

    The main issue we have is: We have observations with a few hundred dimensions. Unfortunately, some of the observations have gaps but we still want to consider those observations. Currently, we just fill gaps with 0 what is actually not the right way because it falsifies the result. What we actually would like to have is, to ignore gaps, so we don't consider the particular dimension for that particular observation at all.

    Does this make sense to you?

    enhancement 
    opened by guenhter 1
  • All observations the same point

    All observations the same point

    Hi, I don't know if this is a bug but I think that it should not be the expected outcome.

    If I have, for example, 10 unidimensional points with the same value (example: 5), and ask for 4 clusters (k = 4), the result is:

    Cluster 0 with 10 data points
    Cluster 1 with 1 data points
    Cluster 2 with 1 data points
    Cluster 3 with 1 data points
    

    So, We have in fact 4 clusters but there are 3 extra data points that may be part of more than 1 cluster.

    The code you can use to simulate:

    package main
    
    import (
    	"fmt"
    	"github.com/muesli/clusters"
    	"github.com/muesli/kmeans"
    )
    
    func main() {
    	const k = 4
    
    	km := kmeans.New()
    	var o clusters.Observations
    	for i := 0; i < 10; i++ {
    		o = append(o, clusters.Coordinates{5})
    	}
    	c, err := km.Partition(o, k)
    	if err != nil {
    		fmt.Printf("error: %s", err.Error)
    		return
    	}
    
    	for i, cluster := range c {
    		fmt.Printf("\nCluster %v with %v data points", i, len(cluster.Observations))
    	}
    }
    
    bug 
    opened by luisfmelo 1
Releases(v0.3.1)
  • v0.3.1(Jul 22, 2022)

    What's Changed

    • ci: update lint workflows by @muesli in https://github.com/muesli/kmeans/pull/23
    • feat: less dependencies by @caarlos0 in https://github.com/muesli/kmeans/pull/22

    New Contributors

    • @caarlos0 made their first contribution in https://github.com/muesli/kmeans/pull/22

    Full Changelog: https://github.com/muesli/kmeans/compare/v0.3.0...v0.3.1

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Sep 13, 2021)

  • v0.2.1(Dec 7, 2020)

  • v0.2.0(Oct 27, 2020)

    Fixes

    • Ensure another iteration after randomly assigning a point to a cluster
    • Never panic in Plot, return an error instead
    • Handle errors calling Plot

    Changes

    • Prefix plot filename with current k
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jun 2, 2018)

Owner
Christian Muehlhaeuser
Geek, Gopher, Software Developer, Maker, Opensource Advocate, Tech Enthusiast, Photographer, Board and Card Gamer
Christian Muehlhaeuser
a simple & tiny scrapy clustering solution, considered a drop-in replacement for scrapyd

scrapyr a very simple scrapy orchestrator engine that could be distributed among multiple machines to build a scrapy cluster, under-the-hood it uses r

Mohammed Al Ashaal 50 Nov 24, 2021
A native Go clean room implementation of the Porter Stemming algorithm.

Go Porter Stemmer A native Go clean room implementation of the Porter Stemming Algorithm. This algorithm is of interest to people doing Machine Learni

Charles Iliya Krempeaux 183 Jan 3, 2023
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Aaron Groves 29 Sep 27, 2022
Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Song2 Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Masaya Watanabe 52 Oct 25, 2022
a* pathfinding algorithm written in go

astar a* (a-star) pathfinding algorithm written in go Wikipedia: EN: A* search algorithm DE: A*-Algorithmus Install go get github.com/jpierer/[email protected]

Julian Pierer 26 Mar 21, 2022
Genetic Algorithm and Particle Swarm Optimization

evoli Genetic Algorithm and Particle Swarm Optimization written in Go Example Problem Given f(x,y) = cos(x^2 * y^2) * 1/(x^2 * y^2 + 1) Find (x,y) suc

Guillaume Simonneau 26 Dec 22, 2022
Golang Genetic Algorithm

goga Golang implementation of a genetic algorithm. See ./examples for info on how to use the library. Overview Goga is a genetic algorithm solution wr

null 177 Dec 19, 2022
An iterative algorithm to generate high-quality triangulated images.

An iterative algorithm to generate high quality triangulated images. Introduction Triangula uses a modified genetic algorithm to triangulate images. I

null 3.8k Dec 29, 2022
An implementation of Neural Turing Machines

Neural Turing Machines Package ntm implements the Neural Turing Machine architecture as described in A.Graves, G. Wayne, and I. Danihelka. arXiv prepr

Fumin 398 Sep 13, 2022
Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch

EGNN - Pytorch Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch. May be eventually used for Alphafold2 replication.

Phil Wang 257 Dec 23, 2022
A high performance go implementation of Wappalyzer Technology Detection Library

wappalyzergo A high performance port of the Wappalyzer Technology Detection Library to Go. Inspired by https://github.com/rverton/webanalyze. Features

ProjectDiscovery 378 Jan 8, 2023
Go implementation of the yolo v3 object detection system

Go YOLO V3 This repository provides a plug and play implementation of the Yolo V3 object detection system in Go, leveraging gocv. Prerequisites Since

Wim Spaargaren 58 Dec 14, 2022
Golang k-d tree implementation with duplicate coordinate support

Golang k-d tree implementation with duplicate coordinate support

DownFlux 46 Nov 9, 2022
Genetic Algorithms library written in Go / golang

Description Genetic Algorithms for Go/Golang Install $ go install git://github.com/thoj/go-galib.git Compiling examples: $ git clone git://github.com

Thomas Jager 193 Sep 27, 2022
Neural Networks written in go

gobrain Neural Networks written in go Getting Started The version 1.0.0 includes just basic Neural Network functions such as Feed Forward and Elman Re

Go Machine Learning 531 Dec 20, 2022
A recommender system service based on collaborative filtering written in Go

Language: English | 中文 gorse: Go Recommender System Engine Build Coverage Report GoDoc RTD Demo gorse is an offline recommender system backend based o

Zhenghao Zhang 6.4k Dec 29, 2022
A Naive Bayes SMS spam classifier written in Go.

Ham (SMS spam classifier) Summary The purpose of this project is to demonstrate a simple probabilistic SMS spam classifier in Go. This supervised lear

Dan Wolf 13 Sep 9, 2022
A simple utility, written in Go, for interacting with Salesforce.

Salesforce CLI A simple utility, written in Go, for interacting with Salesforce. Currently only specific functionality is implemented, and the output

Darren Parkinson 0 Dec 14, 2021
A simple yet customisable program written in go to make hackerman-like terminal effects.

stuntman a simple program written in go to make you look like a hackerman Demo stuntman -binar -width 90 -color cyan stuntman -text -width 90 -vertgap

Solaris 10 Aug 4, 2022