Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

Overview

gse

Go efficient text segmentation; support english, chinese, japanese and other.

Build Status CircleCI Status codecov Build Status Go Report Card GoDoc GitHub release Join the chat at https://gitter.im/go-ego/ego

简体中文

Dictionary with double array trie (Double-Array Trie) to achieve, Sender algorithm is the shortest path based on word frequency plus dynamic programming, and DAG and HMM algorithm word segmentation.

Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes, support user dictionary, POS tagging, run JSON RPC service.

Support HMM cut text use Viterbi algorithm.

Text Segmentation speed single thread 9.2MB/s,goroutines concurrent 26.8MB/s. HMM text segmentation single thread 3.2MB/s. (2core 4threads Macbook Pro).

Binding:

gse-bind, binding JavaScript and other, support more language.

Install / update

go get -u github.com/go-ego/gse

Build-tools

go get -u github.com/go-ego/re

re gse

To create a new gse application

$ re gse my-gse

re run

To run the application we just created, you can navigate to the application folder and execute:

$ cd my-gse && re run

Use

package main

import (
	"fmt"

	"github.com/go-ego/gse"
	"github.com/go-ego/gse/hmm/pos"
)

var (
	text = "Hello world, Helloworld. Winter is coming! 你好世界."

	new = gse.New("zh,testdata/test_dict3.txt", "alpha")

	seg gse.Segmenter
	posSeg pos.Segmenter
)

func cut() {
	hmm := new.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)

	hmm = new.CutAll(text)
	fmt.Println("cut all: ", hmm)
}

func main() {
	cut()

	segCut()
}

func posAndTrim(cut []string) {
	cut = seg.Trim(cut)
	fmt.Println("cut all: ", cut)

	pos.WithGse(seg)
	po := posSeg.Cut(text, true)
	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")
	fmt.Println("trim pos: ", po)
}

func cutPos() {
	fmt.Println(seg.String(text, true))
	fmt.Println(seg.Slice(text, true))

	po := seg.Pos(text, true)
	fmt.Println("pos: ", po)
	po = seg.TrimPos(po)
	fmt.Println("trim pos: ", po)
}

func segCut() {
	// Loading the default dictionary
	seg.LoadDict()
	// Load the dictionary
	// seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt")

	// Text Segmentation
	tb := []byte(text)
	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)

	// Handle word segmentation results
	// Support for normal mode and search mode two participle,
	// see the comments in the code ToString function.
	// The search mode is mainly used to provide search engines
	// with as many keywords as possible
	fmt.Println(gse.ToString(segments, true))
}

Look at an custom dictionary example

package main

import (
	"fmt"

	"github.com/go-ego/gse"
)

func main() {
	var seg gse.Segmenter
	seg.LoadDict("zh,testdata/test_dict.txt,testdata/test_dict1.txt")

	text1 := "你好世界, Hello world"
	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))
	fmt.Println(gse.ToString(segments))
}

Look at an Chinese example

Look at an Japanese example

Authors

License

Gse is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0), thanks for sego and jieba(jiebago).

Issues
  • 老哥,停止词典一直不生效,加了

    老哥,停止词典一直不生效,加了

    package main

    import ( "fmt"

    "github.com/go-ego/gse"
    

    )

    var ( text = "第一次爱的人是谁演唱的" new, _ = gse.New("dict.txt")

    seg gse.Segmenter
    

    )

    func main() { cut() }

    func cut() { new.LoadStop("stop.txt") new.AddStop("的") new.AddStop("是") //加了这行也没用 fmt.Println("cut: ", new.Cut(text, true)) fmt.Println("cut all: ", new.CutAll(text)) fmt.Println("cut for search: ", new.CutSearch(text, true)) fmt.Println(new.String(text, true)) }

    //控制台打印如下所示 //2022/02/18 17:44:34 Dict files path: [dict.txt] //2022/02/18 17:44:34 Load the gse dictionary: "dict.txt" //2022/02/18 17:44:34 Gse dictionary loaded finished. //2022/02/18 17:44:34 Load the stop word dictionary: "stop.txt" //cut: [第一次爱的人 是 谁 演唱 的] //cut all: [第一次爱的人 是 谁 演唱 的] //cut for search: [第一次爱的人 是 谁 演唱 的] //第一次爱的人/n 是/x 谁/x 演唱/v 的/x

    question 
    opened by ColorfulDick 7
  • Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?

    Question: Is there any way to get segment info(not only string but with start and end) in hmm and search mode?

    • Gse version (or commit ref): 0.60
    • Go version: 1.14
    • Operating system and bit: macOS 10.14

    Description

    In my case, I need get start and end info of each word after segmenting in hmm and search mode. By reading apis, I only found:

    • CutSearch(string, true) which only return []string but no star and end infos
    • Segment([]byte(text)) which can return segment with start and end info, but it does not accept param to choose search mode.

    Is there anyway to something like Segment([]byte(text), searchMode)?

    enhancement 
    opened by leopku 7
  • Bleve

    Bleve

    Has anyone tried using this with bleve.

    Bleve does this plus alot more but lacks decent Chinese / Japanese stemmers.

    Using this with bleve would be a powerful stack

    enhancement question 
    opened by ghost 6
  • Could not load dictionaries

    Could not load dictionaries

    I pulled gse through go mod, but I found that the dictionary data in gse was not pulled down, so I found a "Could not load dictionaries" error. Then, I copied the dictionary data into the gse package to run it through. So, I think,

    1. Can you delete the hard-coded dictionary location in gse, or can it be configurable through parameters.
    2. If you want to load dictionary data, is it possible to convert the dictionary data into go static data code through "go-bindata" or other
    question 
    opened by kooksee 5
  • How to build without embed dictionary on Go1.16 or above?

    How to build without embed dictionary on Go1.16 or above?

    Hi,

    I noticed that gse leads to a large binary size. After reviewing the code, I found the problem may lie here, which is caused by the embedded dictionary.

    The program binary may vary, but the dictionary is relatively not changed. So is there a way to build a gse project without embedded dictionary?

    Thanks.

    • Gse version (or commit ref): 0.70.1
    • Go version: 1.17
    • Operating system and bit: Ubuntu 20.04 64bit
    • Can you reproduce the bug at Examples:
      • [x] Yes (provide example code)
      • [ ] No
      • [ ] Not relevant
    question 
    opened by AmazingRise 4
  • Is there any bug of seg.ModeSegment?

    Is there any bug of seg.ModeSegment?

    The result of seg.Segment and seg.ModeSegment are the same, is there any bug?

    I thought the result of ModeSegment should like seg.CutSearch.

    test code:

    package main
    
    import (
    	"fmt"
    
    	"github.com/go-ego/gse"
    )
    
    var (
    	seg  gse.Segmenter
    	text = "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
    )
    
    func main() {
    	seg.LoadDict()
    	addToken()
    	cut()
    }
    
    func addToken() {
    	seg.AddToken("《复仇者联盟3:无限战争》", 100, "n")
    }
    
    // 使用 DAG 或 HMM 模式分词
    func cut() {
    	// "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的的科幻片."
    
    	// use DAG and HMM
    	hmm := seg.Cut(text, true)
    	fmt.Println("cut use hmm: ", hmm)
    	// cut use hmm:  [《复仇者联盟3:无限战争》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]
    
    	cut := seg.Cut(text)
    	fmt.Println("cut: ", cut)
    	// cut:  [《 复仇者 联盟 3 : 无限 战争 》 是 全片 使用 imax 摄影机 拍摄 制作 的 的 科幻片 .]
    
    	hmm = seg.CutSearch(text, true)
    	fmt.Println("cut search use hmm: ", hmm)
    	//cut search use hmm:  [复仇 仇者 联盟 无限 战争 复仇者 《复仇者联盟3:无限战争》 是 全片 使用 imax 摄影 摄影机 拍摄 制作 的 的 科幻 科幻片 .]
    	fmt.Println("analyze: ", seg.Analyze(hmm, text))
    
    	cut = seg.CutSearch(text)
    	fmt.Println("cut search: ", cut)
    	// cut search:  [《 复仇 者 复仇者 联盟 3 : 无限 战争 》 是 全片 使用 imax 摄影 机 摄影机 拍摄 制作 的 的 科幻 片 科幻片 .]
    
    	segment1 := seg.Segment([]byte(text))
    	for i, token := range segment1 {
    		fmt.Println(i, token.Token().Text())
    	}
    	segment2 := seg.ModeSegment([]byte(text), true)
    	for i, token := range segment2 {
    		fmt.Println(i, token.Token().Text())
    	}
    }
    
    question 
    opened by hengfeiyang 3
  • "\001" in text gets error result

    • Gse version (or commit ref): 1fd1428e78fe
    • Go version: 1.14.2
    • Operating system and bit: any
    • Can you reproduce the bug at Examples:
      • [ ] No
      • [ ] Yes (provide example code)
      • [x] Not relevant
    • Provide example code:
    func TestSegment(t *testing.T) {
    	seg := &gse.Segmenter{}
    	err := seg.LoadDict("../data/dictionary.txt")
    	if err != nil {
    		t.Fatal(err)
    	}
    	data := []byte("\001你好吗", )
    	res := seg.Segment(data)
    	for _, re := range res {
    		t.Log(re.Token().Text())
    		t.Log(re.Start())
    		t.Log(re.End())
    	}
    }
    
    • Log gist:
    
        TestSegment: process_test.go:51: 你
        TestSegment: process_test.go:52: 0
        TestSegment: process_test.go:53: 3
        TestSegment: process_test.go:51: 你好
        TestSegment: process_test.go:52: 3
        TestSegment: process_test.go:53: 9
        TestSegment: process_test.go:51: 吗
        TestSegment: process_test.go:52: 9
        TestSegment: process_test.go:53: 12
    

    Description

    the first token should be "\001", we get second word instand. the start of second token should be 1.

    question 
    opened by zhyon404 3
  • 老哥,停止词典的那个方法一直无法生效,咋回事呀

    老哥,停止词典的那个方法一直无法生效,咋回事呀

    package main

    import ( "fmt"

    "github.com/go-ego/gse"
    

    )

    var ( text = "第一次爱的人是谁演唱的" new, _ = gse.New("dict.txt")

    seg gse.Segmenter
    

    )

    func main() { cut() }

    // loadDictEmbed supported from go1.16 func loadDictEmbed() { seg.LoadDictEmbed() seg.LoadStopEmbed() }

    func cut() { new.LoadStop("stop.txt") new.IsStop("是") //将“是“加入停止词典以后,“是”仍然出现在了分词结果中 fmt.Println("cut: ", new.Cut(text, true)) fmt.Println("cut all: ", new.CutAll(text)) fmt.Println("cut for search: ", new.CutSearch(text, true)) fmt.Println(new.String(text, true)) }

    // 输出结果如下: // cut: [第一次爱的人 是 谁 演唱 的] //cut all: [第一次爱的人 是 谁 演 唱 的] //cut for search: [第一次爱的人 是 谁 演唱 的] // 第一次爱的人/n 是/x 谁/x 演/x 唱/x 的/x

    question 
    opened by ColorfulDick 2
  • Float should not be split

    Float should not be split

    Split “ loss of 76.7”. I got "loss / of / 76 / . / 7", I want got "loss / of / 76 . 7".

    What i can do?

    • Gse version (or commit ref):v0.69.3
    • Go version:1.16
    question 
    opened by mkdreams 2
  • - temporarily disable debug output until a better mechanism

    - temporarily disable debug output until a better mechanism

    ... to turn them on

    Please provide Issues links to:

    • Issues: #1 https://github.com/go-ego/gpy/issues/16

    Provide test code:

    $ pinyin -p 银行
    

    Description

    The output from pinyin -p looks very like debug output, disable it temporarily before we can have a better way to turn it on.

    opened by suntong 2
  • Invalid ranges produced on bad inputs

    Invalid ranges produced on bad inputs

    segments := segmenter.Segment([]byte(w))
    		for i, seg := range segments {
    			if seg.End() > len(w) {
    				log.Println("bad split: ", seg.Start(), "/", seg.End(), " w '", w, "' len ", len(w), "hex ", hex.EncodeToString([]byte(w)), "i ", i)
    			} else {
    				log.Println(w[seg.Start():seg.End()])`
    			}
    		}
    

    The code above gives bad ranges on bad inputs, examples: bad split: 3 / 6 w ' 缊 ' len 4 hex 01e7bc8a i 1 bad split: 5 / 8 w ' Vm犹 ' len 6 hex 566d01e78ab9 i 2 bad split: 5 / 8 w ' Vm犹 ' len 6 hex 566d01e78ab9 i 2 bad split: 3 / 6 w ' 榬 ' len 4 hex 01e6a6ac i 1

    Expected behavior is that seg.End() should always be in range 0..len(w)

    1. ego commit is recent, bdc71ec04efb0af77cf548eda828c001a76f3ae1
    2. go version go1.9 windows/amd64
    package main
    
    import (
    "flag"
    "fmt"
    
    "github.com/go-ego/gse"
    	"encoding/hex"
    	"log"
    )
    
    func main() {
    	flag.Parse()
    	var seg gse.Segmenter
    	seg.LoadDict()
    	text, _ := hex.DecodeString("01e7bc8a")
    	segments := seg.Segment([]byte(text))
    	fmt.Println(gse.ToString(segments, true))
    	for _, seg := range segments {
    		log.Println(text[seg.Start():seg.End()])
    	}
    }
    
    2017/11/12 10:41:25 载入 gse 词典 C:/Users/Valle/go/src/github.com/go-ego/gse/data/dict/dictionary.txt
    缊/zg 缊/zg 
    2017/11/12 10:41:27 gse 词典载入完毕
    2017/11/12 10:41:27 [1 231 188]
    
    panic: runtime error: slice bounds out of range
    
    goroutine 1 [running]:
    main.main()
    
    enhancement 
    opened by ValleZ 2
  • V1 Release?

    V1 Release?

    Hi, I was looking for a good Chinese/Japanese tokenizer in Go and stumbled across this one.

    Based on the release history it seems like it looks like this library has been in use for quite a while, but it's still v0. Any reason not to issue an official v1 release?

    It would also be nice to see quality metrics on the readme, if you have any. E.g. comparison to data like https://universaldependencies.org/

    question 
    opened by dankinder 1
Releases(v0.70.2)
Utilities for working with discrete probability distributions and other tools useful for doing NLP work

GNLP A few structures for doing NLP analysis / experiments. Basics counter.Counter A map-like data structure for representing discrete probability dis

Matt Jones 92 Aug 9, 2022
[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp nlp is a general purpose any-lang Natural Language Processor that parses the data inside a text and returns a filled model Supported types int in

Juan Alvarez 379 Jul 28, 2022
Self-contained Japanese Morphological Analyzer written in pure Go

Kagome v2 Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, Uni

ikawaha 664 Aug 5, 2022
A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Peter Kleiweg 67 Feb 13, 2022
Chinese word splitting algorithm MMSEG in GO

MMSEGO This is a GO implementation of MMSEG which a Chinese word splitting algorithm. TO DO list Documentation/comments Benchmark Usage #Input Diction

Andy Song 61 Feb 21, 2022
Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.

Stemmer package for Go Stemmer package provides an interface for stemmers and includes English, German and Dutch stemmers as sub-packages: porter2 sub

Dmitry Chestnykh 51 Jan 23, 2022
Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

Gopher Translator Service An interview assignment project. To see the full assig

Teodor Draganov 0 Jan 25, 2022
A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

Joseph Kato 2.9k Aug 7, 2022
ASCII transliterations of Unicode text.

go-unidecode ASCII transliterations of Unicode text. Inspired by python-unidecode. Installation go get -u github.com/mozillazg/go-unidecode Install C

Huang Huang 93 Jul 30, 2022
A tool to find all duplicates in large sets of text documents.

⊧ dupi Dupi is an engine for identifying and exploring duplicative text in sets of documents. Status Dupi is in alpha/early beta development stage. Pl

go-air 13 Mar 3, 2022
i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

go-localize Simple and easy to use i18n (Internationalization and localization) engine written in Go, used for translating locale strings. Use with go

Miles Croxford 38 Jul 17, 2022
Read and use word2vec vectors in Go

Introduction This is a package for reading word2vec vectors in Go and finding similar words and analogies. Installation This package can be installed

Daniël de Kok 47 Jul 3, 2022
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

Natural Language Processing Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for t

James Bowman 371 Aug 5, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

NLP Odyssey 1.2k Aug 1, 2022
A go library for reading and creating ISO9660 images

iso9660 A package for reading and creating ISO9660 Joliet and Rock Ridge extensions are not supported. Examples Extracting an ISO package main import

Kamil Domański 212 Aug 3, 2022
Package i18n provides internationalization and localization for your Go applications.

i18n Package i18n provides internationalization and localization for your Go applications. Installation The minimum requirement of Go is 1.16. go get

null 53 Aug 8, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

ZoomIO 23 Jul 17, 2022
An easy-to-use OCR and Japanese to English translation tool

Manga Translator An easy-to-use application for translating text in images from Japanese to English. The GUI was created using Gio. Gio supports a var

Cameron Kinsella 40 Aug 11, 2022
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 Features Currently only segmentation at Word

bleve 70 Apr 24, 2022
Utilities for working with discrete probability distributions and other tools useful for doing NLP work

GNLP A few structures for doing NLP analysis / experiments. Basics counter.Counter A map-like data structure for representing discrete probability dis

Matt Jones 92 Aug 9, 2022
Commonwords - Simple cli to find words in text that are not in the 1000 most common English words

Thousand common words Find words in a text that are not in the 1000 most common

Sakari Mursu 0 Feb 1, 2022
[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp nlp is a general purpose any-lang Natural Language Processor that parses the data inside a text and returns a filled model Supported types int in

Juan Alvarez 379 Jul 28, 2022
Self-contained Japanese Morphological Analyzer written in pure Go

Kagome v2 Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, Uni

ikawaha 664 Aug 5, 2022
Tools to help with Japanese sentence mining

Tools to help with Japanese sentence mining

Anton Van Eechaute 2 Dec 10, 2021
qclean lets you to clean up search query in japanese.

qclean qclean lets you to clean up search query in japanese. This is mainly used to remove wasted space. Quick Start package main var cleaner *qclean

po3rin 0 Jan 4, 2022
A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Peter Kleiweg 67 Feb 13, 2022
A Go package for n-gram based text categorization, with support for utf-8 and raw text

A Go package for n-gram based text categorization, with support for utf-8 and raw text. To do: write documentation make it faster Keywords: text categ

Peter Kleiweg 67 Feb 13, 2022
Chinese word splitting algorithm MMSEG in GO

MMSEGO This is a GO implementation of MMSEG which a Chinese word splitting algorithm. TO DO list Documentation/comments Benchmark Usage #Input Diction

Andy Song 61 Feb 21, 2022
Convert Arabic numeric amounts to Chinese character

将阿拉伯数字金额转换为汉字的形式 Convert Arabic numeric amounts to Chinese character form. 安装使用 Golang 版本大于等于1.16 go get -u github.com/aliliin/rmb-character import (

高永立 4 Sep 9, 2021