A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)

Overview

gotokenizer GoDoc Build Status Coverage Status Go Report Card License Awesome

A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)

Motivation

I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.

Features

  • Support Maximum Matching Method
  • Support Minimum Matching Method
  • Support Reverse Maximum Matching
  • Support Reverse Minimum Matching
  • Support Bidirectional Maximum Matching
  • Support Bidirectional Minimum Matching
  • Support using Stop Tokens
  • Support Custom word Filter

Installation

go get -u github.com/xujiajun/gotokenizer

Usage

package main

import (
	"fmt"

	"github.com/xujiajun/gotokenizer"
)

func main() {
	text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器,支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"

	dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
	// NewMaxMatch default wordFilter is NumAndLetterWordFilter
	mm := gotokenizer.NewMaxMatch(dictPath)
	// load dict
	mm.LoadDict()

	fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 , 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>

	// enabled filter stop tokens 
	mm.EnabledFilterStopToken = true
	mm.StopTokens = gotokenizer.NewStopTokens()
	stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
	mm.StopTokens.Load(stopTokenDicPath)

	fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
	fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>

}

More examples see tests

Contributing

If you'd like to help out with the project. You can put up a Pull Request.

Author

License

The gotokenizer is open-sourced software licensed under the Apache-2.0

Acknowledgements

This package is inspired by the following:

https://github.com/ysc/word

Issues
Releases(v1.1.0)
  • v1.1.0(Oct 17, 2018)

  • v1.0.0(Oct 15, 2018)

    • Support Maximum Matching Method
    • Support Minimum Matching Method
    • Support Reverse Maximum Matching
    • Support Reverse Minimum Matching
    • Support Bidirectional Maximum Matching
    • Support Bidirectional Minimum Matching
    • Support using Stop Tokens
    Source code(tar.gz)
    Source code(zip)
Owner
徐佳军
You will never know what you can do till you try.
徐佳军
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

bleve 69 Dec 7, 2021
Chinese word splitting algorithm MMSEG in GO

MMSEGO This is a GO implementation of MMSEG which a Chinese word splitting algorithm. TO DO list Documentation/comments Benchmark Usage #Input Diction

Andy Song 59 Aug 5, 2020
A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Peter Kleiweg 65 Jun 21, 2021
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

James Bowman 351 Jan 1, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

NLP Odyssey 1.1k Jan 13, 2022
Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Stemmer package for Go Stemmer package provides an interface for stemmers and includes English, German and Dutch stemmers as sub-packages: porter2 sub

Dmitry Chestnykh 50 Nov 17, 2021
Natural language detection package in pure Go

getlang getlang provides fast natural language detection in Go. Features Offline -- no internet connection required Supports 29 languages Provides ISO

Rylan 126 Nov 15, 2021
Natural language detection library for Go

Whatlanggo Natural language detection for Go. Features Supports 84 languages 100% written in Go No external dependencies Fast Recognizes not only a la

Abado Jack Mtulla 517 Jan 1, 2022
A natural language date/time parser with pluggable rules

when when is a natural language date/time parser with pluggable rules and merge strategies Examples tonight at 11:10 pm at Friday afternoon the deadli

Oleg Lebedev 1.1k Dec 29, 2021
i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

go-localize Simple and easy to use i18n (Internationalization and localization) engine written in Go, used for translating locale strings. Use with go

Miles Croxford 29 Dec 7, 2021
Utilities for working with discrete probability distributions and other tools useful for doing NLP work

GNLP A few structures for doing NLP analysis / experiments. Basics counter.Counter A map-like data structure for representing discrete probability dis

Matt Jones 90 Dec 24, 2021
Read and use word2vec vectors in Go

Introduction This is a package for reading word2vec vectors in Go and finding similar words and analogies. Installation This package can be installed

Daniël de Kok 43 Dec 24, 2021
[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp nlp is a general purpose any-lang Natural Language Processor that parses the data inside a text and returns a filled model Supported types int in

Juan Alvarez 376 Dec 1, 2021
A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

Joseph Kato 2.9k Jan 15, 2022
A go library for reading and creating ISO9660 images

iso9660 A package for reading and creating ISO9660 Joliet and Rock Ridge extensions are not supported. Examples Extracting an ISO package main import

Kamil Domański 199 Dec 20, 2021
Package i18n provides internationalization and localization for your Go applications.

i18n Package i18n provides internationalization and localization for your Go applications. Installation The minimum requirement of Go is 1.16. go get

null 53 Jan 5, 2022
Database Abstraction Layer (dbal) for Go. Support SQL builder and get result easily (now only support mysql)

godbal Database Abstraction Layer (dbal) for go (now only support mysql) Motivation I wanted a DBAL that No ORM、No Reflect、Concurrency Save, support S

徐佳军 52 Apr 22, 2021
A multilingual command line sentence tokenizer in Golang

Sentences - A command line sentence tokenizer This command line utility will convert a blob of text into a list of sentences. Demo Docs Install go get

Eric Bower 315 Jan 3, 2022
Build "Dictionary of the Old Norwegian Language" into easier-to-use data formats

Old Norwegian Dictionary Builder Build "Dictionary of the Old Norwegian Language" into easier-to-use data formats. Available formats: JSON DSL XML Usa

Sampo Silvennoinen 0 Jan 13, 2022
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

bleve 69 Dec 7, 2021
A golang client for the Twitch v3 API - public APIs only (for now)

go-twitch Test CLIENT_ID="<my client ID>" go test -v -cover Usage Example File: package main import ( "log" "os" "github.com/knspriggs/go-twi

Kristian Spriggs 21 May 10, 2021
Urban Dictionary CLI app.

Urban Dictionary Urban Dictionary CLI app. Download Latest Release: GitHub Release Usage urban "term" [page number] (Get list of definitions b

XXIV 0 Jan 9, 2022
Urban Dictionary API client for Go.

Urban Dictionary Urban Dictionary API client for Go. Download go get github.com/thexxiv/urbandictionary-go Example func main() { urban := urbandicti

XXIV 0 Jan 9, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

ZoomIO 17 Jan 10, 2022
Chinese word splitting algorithm MMSEG in GO

MMSEGO This is a GO implementation of MMSEG which a Chinese word splitting algorithm. TO DO list Documentation/comments Benchmark Usage #Input Diction

Andy Song 59 Aug 5, 2020
Convert Arabic numeric amounts to Chinese character

将阿拉伯数字金额转换为汉字的形式 Convert Arabic numeric amounts to Chinese character form. 安装使用 Golang 版本大于等于1.16 go get -u github.com/aliliin/rmb-character import (

高永立 4 Sep 9, 2021
A youtube library for retrieving metadata, and obtaining direct links to video-only/audio-only/mixed versions of videos on YouTube in Go.

A youtube library for retrieving metadata, and obtaining direct links to video-only/audio-only/mixed versions of videos on YouTube in Go. Install go g

José Pastor 3 Oct 22, 2021
Mongo Go Models (mgm) is a fast and simple MongoDB ODM for Go (based on official Mongo Go Driver)

Mongo Go Models Important Note: We changed package name from github.com/Kamva/mgm/v3(uppercase Kamva) to github.com/kamva/mgm/v3(lowercase kamva) in v

kamva 447 Jan 22, 2022
log4jScanner: provides you with the ability to scan internal (only) subnets for vulnerable log4j web servicelog4jScanner: provides you with the ability to scan internal (only) subnets for vulnerable log4j web service

log4jScanner Goals This tool provides you with the ability to scan internal (only) subnets for vulnerable log4j web services. It will attempt to send

Profero 465 Jan 17, 2022