Natural language detection library for Go

Overview

Whatlanggo

Build Status Go Report Card GoDoc Coverage Status

Natural language detection for Go.

Features

  • Supports 84 languages
  • 100% written in Go
  • No external dependencies
  • Fast
  • Recognizes not only a language, but also a script (Latin, Cyrillic, etc)

Getting started

Installation:

    go get -u github.com/abadojack/whatlanggo

Simple usage example:

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	info := whatlanggo.Detect("Foje funkcias kaj foje ne funkcias")
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script], " Confidence: ", info.Confidence)
}

Blacklisting and whitelisting

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	//Blacklist
	options := whatlanggo.Options{
		Blacklist: map[whatlanggo.Lang]bool{
			whatlanggo.Ydd: true,
		},
	}

	info := whatlanggo.DetectWithOptions("האקדמיה ללשון העברית", options)

	fmt.Println("Language:", info.Lang.String(), "Script:", whatlanggo.Scripts[info.Script])

	//Whitelist
	options1 := whatlanggo.Options{
		Whitelist: map[whatlanggo.Lang]bool{
			whatlanggo.Epo: true,
			whatlanggo.Ukr: true,
		},
	}

	info = whatlanggo.DetectWithOptions("Mi ne scias", options1)
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script])
}

For more details, please check the documentation.

Requirements

Go 1.8 or higher

How does it work?

How does the language recognition work?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How IsReliable calculated?

It is based on the following factors:

  • How many unique trigrams are in the given text
  • How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

Language recognition whatlang rust

For more details, please check a blog article Introduction to Rust Whatlang Library and Natural Language Identification Algorithms.

License

MIT

Derivation

whatlanggo is a derivative of Franc (JavaScript, MIT) by Titus Wormer.

Acknowledgements

Thanks to greyblake (Potapov Sergey) for creating whatlang-rs from where I got the idea and algorithms.

Issues
  • Required minimum version?

    Required minimum version?

    I tried to build this on 1.6.2 and got this error:

    $ go get -u github.com/abadojack/whatlanggo
    # github.com/abadojack/whatlanggo
    gocode/src/github.com/abadojack/whatlanggo/detect.go:120: undefined: sort.SliceStable
    gocode/src/github.com/abadojack/whatlanggo/trigrams.go:31: undefined: sort.SliceStable
    
    $ go version
    go version go1.6.2 linux/amd64
    
    duplicate 
    opened by peterbe 2
  • Undefined Sort.SliceStable

    Undefined Sort.SliceStable

    github.com/abadojack/whatlanggo

    ../../.gvm/pkgsets/go1.7.3/global/src/github.com/abadojack/whatlanggo/detect.go:120: undefined: sort.SliceStable ../../.gvm/pkgsets/go1.7.3/global/src/github.com/abadojack/whatlanggo/trigrams.go:31: undefined: sort.SliceStable

    opened by thiagozs 2
  • Improve Japanese judgment processing.

    Improve Japanese judgment processing.

    In the current process, there are cases where Japanese is also judged as Mandarin. If Hiragana or Katakana is included, even if judged as Mandarin, it is regarded as Japanese. Japanese uses Kanji (unicode.Han) in addition to Hiragana and Katakana.

    opened by WhiteRaven777 1
  • func GetListOfLangsBaseOnScript

    func GetListOfLangsBaseOnScript

    Thnx for a great lib Ive add a func to get a list of Lang by script Maybe it will be useful for you too

    func GetListOfLangsBaseOnScript(script *unicode.RangeTable) []Lang {
    	var res []Lang
    	switch script {
    	case unicode.Latin:
    		for k, _ := range latinLangs {
    			res = append(res, k)
    		}
    		return res
    	case unicode.Cyrillic:
    		for k, _ := range cyrillicLangs {
    			res = append(res, k)
    		}
    		return res
    
    	case unicode.Devanagari:
    		for k, _ := range devanagariLangs {
    			res = append(res, k)
    		}
    		return res
    	case unicode.Hebrew:
    		for k, _ := range hebrewLangs {
    			res = append(res, k)
    		}
    		return res
    	case unicode.Ethiopic:
    		for k, _ := range ethiopicLangs {
    			res = append(res, k)
    		}
    		return res
    	case unicode.Arabic:
    		for k, _ := range arabicLangs {
    			res = append(res, k)
    		}
    		return res
    	case unicode.Han:
    		res = append(res, Cmn)
    		return res
    	case unicode.Bengali:
    		res = append(res, Ben)
    		return res
    	case unicode.Hangul:
    		res = append(res, Kor)
    		return res
    	case unicode.Georgian:
    		res = append(res, Kat)
    		return res
    	case unicode.Greek:
    		res = append(res, Ell)
    		return res
    	case unicode.Kannada:
    		res = append(res, Kan)
    		return res
    	case unicode.Tamil:
    		res = append(res, Tam)
    		return res
    	case unicode.Thai:
    		res = append(res, Tha)
    		return res
    	case unicode.Gujarati:
    		res = append(res, Guj)
    		return res
    	case unicode.Gurmukhi:
    		res = append(res, Pan)
    		return res
    	case unicode.Telugu:
    		res = append(res, Tel)
    		return res
    	case unicode.Malayalam:
    		res = append(res, Mal)
    		return res
    	case unicode.Oriya:
    		res = append(res, Ori)
    		return res
    	case unicode.Myanmar:
    		res = append(res, Mya)
    		return res
    	case unicode.Sinhala:
    		res = append(res, Sin)
    		return res
    	case unicode.Khmer:
    		res = append(res, Khm)
    		return res
    	case unicode.Katakana:
    		res = append(res, Jpn)
    		return res
    	case unicode.Hiragana:
    		res = append(res, Jpn)
    		return res
    	}
    	return nil
    }
    
    enhancement 
    opened by khamamet 1
  • Fixed a few typos in the README.md

    Fixed a few typos in the README.md

    None of these are really important and some may just depend on how you want to format it, but I thought I would suggest these changes for spelling and grammar.

    opened by flowonyx 0
  • Language detection issue

    Language detection issue

    While using "Detect" function Arabic and English in not detecting properly.

    We are expecting language as "english" for "hi" and "hello".

    But for "hi" getting below response Language: Zulu Script: Latin Confidence: 0.005592493630771142

    and for "hello" getting response as Language: Somali Script: Latin Confidence: 0.010694234025487925

    How can we provide 2 default languages like "arabic" and "english"? If "arabic" is not detected should provide language as "english" with confidence.

    If we try to detect a string with 2 languages "الأجهزة تحت testing type" not getting either english nor arabic.

    Language: Uyghur Script: Latin Confidence: 0.06648113790970933

    Any idea to handle this.

    opened by jyothisjose 0
  • what means a super negative confidence rate

    what means a super negative confidence rate

    Hi,

    Hope you are all well !

    I have -18.66532829205885 or -10.605926394815977 confidence rate, what does that mean ?

    Language: Yoruba  Script: Latin  Confidence:  -8.652592309409306
    Language: Turkmen  Script: Latin  Confidence:  -5.528339197102301
    Language: Yoruba  Script: Latin  Confidence:  -8.163311123289779
    Language: Chewa  Script: Latin  Confidence:  -0.8738781333466048
    Language: Yoruba  Script: Latin  Confidence:  -7.287061394685147
    Language: Yoruba  Script: Latin  Confidence:  -9.46254452788719
    Language: Mandarin  Script: Han  Confidence:  1
    Language: English  Script: Latin  Confidence:  -18.66532829205885
    Language: Yoruba  Script: Latin  Confidence:  -10.605926394815977
    

    Cheers, X

    opened by x0rzkov 2
  • detection problem for short text / training option

    detection problem for short text / training option

    Hi,

    Hope you are all well !

    I have a problem to detect french language on short sentences like the one below.

    | Sentence | Language Detected | Real Language | Location | | -------------- | --------- | -------- | -------- | |Ras. | Esperanto | French | France | |RAS bon. | Esperanto | French | France | |PAS DE SOUCI.| Portuguese | French | France | |Bien.| Spanish | French | France | |RIEN A SIGNALER.| Spanish | French | France | |Nickel.| Polish| French | France | |Pas assez de recul.| Portuguese |French| France | |Je recommande.| Dutch|French| France |

    Is there a way to train the model with additional patterns/sentences in order to improve detection confidence ?

    Btw, I know the location of these sentence, like they are all from France, is there a way to influence the score with an additional parameter like the location ?

    Thanks in advance for any insights or solutions !

    Cheers, X

    opened by x0rzkov 0
  • Not detecting language in large text

    Not detecting language in large text

    Example https://play.golang.org/p/qupLXwVQc4m

    First example is a large text in English. The library can't produce confident result - confidence is negative. Second example is a couple of sentences from the same text. The result is correct. It doesn't matter in which language the text is. After certain threshold it will always break.

    I checked https://github.com/kapsteur/franco that seems to be using the same model and trigrams. It works.

    opened by creker 1
Owner
Abado Jack Mtulla
Thunderbolt and lightning very very frightening
Abado Jack Mtulla
Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

NLP Odyssey 1.2k Aug 1, 2022
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

James Bowman 371 Aug 5, 2022
A natural language date/time parser with pluggable rules

when when is a natural language date/time parser with pluggable rules and merge strategies Examples tonight at 11:10 pm at Friday afternoon the deadli

Oleg Lebedev 1.2k Aug 1, 2022
Cross platform locale detection for Golang

go-locale go-locale is a Golang lib for cross platform locale detection. OS Support Support all OS that Golang supported, except android: aix: IBM AIX

Xuanwo 85 Jul 27, 2022
Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Stemmer package for Go Stemmer package provides an interface for stemmers and includes English, German and Dutch stemmers as sub-packages: porter2 sub

Dmitry Chestnykh 51 Jan 23, 2022
Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

Gopher Translator Service An interview assignment project. To see the full assig

Teodor Draganov 0 Jan 25, 2022
Complete Translation - translate a document to another language

Complete Translation This project is to translate a document to another language. The initial target is English to Korean. Consider this project is no

코딩냄비 4 Feb 25, 2022
Go bindings for the snowball libstemmer library including porter 2

Go (golang) bindings for libstemmer This simple library provides Go (golang) bindings for the snowball libstemmer library including the popular porter

Richard Johnson 19 Feb 17, 2022
Cgo binding for icu4c library

About Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1. Installation Installation consis

Dmitry Bondarenko 20 Jan 23, 2022
A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

Joseph Kato 2.9k Aug 7, 2022
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

bleve 70 Apr 24, 2022
Cgo binding for Snowball C library

Description Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality. For more detailed info see http://snowball.tartar

Dmitry Bondarenko 31 Feb 13, 2022
A go library for reading and creating ISO9660 images

iso9660 A package for reading and creating ISO9660 Joliet and Rock Ridge extensions are not supported. Examples Extracting an ISO package main import

Kamil Domański 212 Aug 3, 2022
Natural language detection library for Go

Whatlanggo Natural language detection for Go. Features Supports 84 languages 100% written in Go No external dependencies Fast Recognizes not only a la

Abado Jack Mtulla 547 Aug 3, 2022
👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

?? The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

Peter M. Stahl 688 Jul 28, 2022
👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

Peter M. Stahl 688 Jul 28, 2022
Natural language detection package in pure Go

getlang getlang provides fast natural language detection in Go. Features Offline -- no internet connection required Supports 29 languages Provides ISO

Rylan 138 Jul 28, 2022
Natural-deploy - A natural and simple way to deploy workloads or anything on other machines.

Natural Deploy Its Go way of doing Ansibles: Motivation: Have you ever felt when using ansible or any declarative type of program that is used for dep

Akilan Selvacoumar 0 Jan 3, 2022
Fast face detection, pupil/eyes localization and facial landmark points detection library in pure Go.

Pigo is a pure Go face detection, pupil/eyes localization and facial landmark points detection library based on Pixel Intensity Comparison-based Objec

Endre Simo 3.8k Aug 12, 2022
Elkeid is a Cloud-Native Host-Based Intrusion Detection solution project to provide next-generation Threat Detection and Behavior Audition with modern architecture.

Elkeid is a Cloud-Native Host-Based Intrusion Detection solution project to provide next-generation Threat Detection and Behavior Audition with modern architecture.

Bytedance Inc. 1.3k Aug 14, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

NLP Odyssey 1.2k Aug 1, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

Self-contained Machine Learning and Natural Language Processing library in Go

NLP Odyssey 1.2k Aug 10, 2022
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

James Bowman 371 Aug 5, 2022
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

James Bowman 371 Aug 5, 2022
A natural language date/time parser with pluggable rules

when when is a natural language date/time parser with pluggable rules and merge strategies Examples tonight at 11:10 pm at Friday afternoon the deadli

Oleg Lebedev 1.2k Aug 1, 2022
Guess the natural language of a text in Go

guesslanguage This is a Go version of python guess-language. guesslanguage provides a simple way to detect the natural language of unicode string and

Nikita Vershinin 55 Jul 22, 2022
A natural language date/time parser with pluggable rules

when when is a natural language date/time parser with pluggable rules and merge strategies Examples tonight at 11:10 pm at Friday afternoon the deadli

Oleg Lebedev 1.2k Aug 1, 2022
A simple, efficient spring animation library for smooth, natural motion🎼

Harmonica A simple, efficient spring animation library for smooth, natural motion. It even works well on the command line.

Charm 547 Aug 6, 2022
Go web framework with a natural feel

Fireball Overview Fireball is a package for Go web applications. The primary goal of this package is to make routing, response writing, and error hand

Zack Patrick 58 Jul 22, 2022