Chinese word splitting algorithm MMSEG in GO

Overview

MMSEGO

This is a GO implementation of MMSEG which a Chinese word splitting algorithm.

TO DO list

  • Documentation/comments
  • Benchmark

Usage

#Input Dictionary Format

Key\tFreq

Each key occupies one line. The file should be utf-8 encoded, please refer to go-darts

#Code example

package main

import (
    "fmt"
    "time"
    "os"
    "mmsego"
    "bufio"
    "log"
    )

func main() {
    var s = new(mmsego.Segmenter)
    s.Init("darts.lib")
    if err != nil {
	log.Fatal(err)
    }

    t := time.Now()
    offset := 0

    unifile, _ := os.Open("/tmp/a.txt")
    uniLineReader := bufio.NewReaderSize(unifile, 4000)
    line, bufErr := uniLineReader.ReadString('\n')
    for nil == bufErr {
	//takeWord := func(off int, length int){ fmt.Printf("%s ", string(line[off-offset:off-offset+length])) }
	takeWord := func(off, length int){ }
	s.Mmseg(line[:], offset, takeWord, nil, false)
	offset += len(line)
	line, bufErr = uniLineReader.ReadString('\n')
    }
    takeWord := func(off int, length int){ fmt.Printf("%s ", string(line[off-offset:off-offset+length])) }
    s.Mmseg(line, offset, takeWord, nil, true)

    fmt.Printf("Duration: %v\n", time.Since(t))
}

LICENSE

Apache License 2.0

You might also like...
Convert Arabic numeric amounts to Chinese character

将阿拉伯数字金额转换为汉字的形式 Convert Arabic numeric amounts to Chinese character form. 安装使用 Golang 版本大于等于1.16 go get -u github.com/aliliin/rmb-character import (

Go-enum-algorithm - Implement an enumeration algorithm in GO

go-enum-algorithm implement an enumeration algorithm in GO run the code go run m

A Golang library to manipulate strings according to the word parsing rules of the UNIX Bourne shell.

shellwords A Golang library to manipulate strings according to the word parsing rules of the UNIX Bourne shell. Installation go get github.com/Wing924

The shamoji (杓文字) is a word filtering package

shamoji About The shamoji (杓文字) is word filtering package. Install $ go get -u github.com/osamingo/shamoji Usage package main import ( "fmt" "sync

Pure go library for creating and processing Office Word (.docx), Excel (.xlsx) and Powerpoint (.pptx) documents
Pure go library for creating and processing Office Word (.docx), Excel (.xlsx) and Powerpoint (.pptx) documents

unioffice is a library for creation of Office Open XML documents (.docx, .xlsx and .pptx). Its goal is to be the most compatible and highest performan

Convert Microsoft Word Document to Markdown
Convert Microsoft Word Document to Markdown

docx2md Convert Microsoft Word Document to Markdown Usage $ docx2md NewDocument.docx Installation $ go get github.com/mattn/docx2md Supported Styles

A tool that helps you write code in your favorite IDE: your word processor!
A tool that helps you write code in your favorite IDE: your word processor!

WordIDE Have you ever wondered: How would it feel like to write code in a word processor? Me neither. But after months minutes of planning, I present

Online Preview Word,Excel,PPT,PDF,Image by Golang
Online Preview Word,Excel,PPT,PDF,Image by Golang

Online Preview Word,Excel,PPT,PDF,Image by Golang.基于Golang的在线预览Word,Excel,PPT,PDF,图片.

word2text - a tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free
word2text - a tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free

This tool is to convert word documents (DocX) to text on the CLI with zero dependencies for free. This tool has been tested on: - Linux 32bit and 64 bit - Windows 32 bit and 64 bit - OpenBSD 64 bit

A targeted word list generation tool
A targeted word list generation tool

dirtywords Inspired by gau, dirtywords builds targeted wordlists for a given domain using "dirty" knowledge from AlienVault's Open Threat Exchange, th

golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)

Go View File 在线体验地址 http://39.97.98.75:8082/view/upload (不会经常更新,保留最基本的预览功能。服务器配置较低,如果出现链接超时请等待几秒刷新重试,或者换Chrome) 目前已经完成 docker部署 (不用为运行环境烦恼) Wor

Simple word guessing game written in golang.
Simple word guessing game written in golang.

Word Guessing Game Simple word guessing game written in golang. successTexts := []string{ "You guessed right! 👏🏻🥳🎉", "Correct! ✅", "Horray!

An application for generating Microsoft Word resumes from JSON Resume data files

ResumeFodder NOTE: Primary development has moved over to GitLab: https://github.com/andrzejressel/ResumeFodder. If you're reading this on GitHub, then

A Telegram word game bot in AZERBAIJAN

Bu bot söz oyunu botudur və AZƏRBAYCANDA ilk botdur! Telegramda vaxtınızı dahada maraqlı keçirin :)

Coding challenge - Word Counts

Word Count Challenge Run the program Build the application with go build. The application accepts input from filename parameters, or from STDIN. For e

Find all words spellable from the letters in a given word

Spellable Given a dictionary of words, find all of the words that can be spelled

Смена автора в программах Microsoft Office (Word, Ecxel, PowerPoint) на случай если твой препод палит лабы по автору документа

AuthorChanger This program helps you to change Microsoft Office 2013-2019 document author. Works with MS Word, MS Excel, MS PowerPoint. Usage Clone a

Words - help with a word finder game, sketches a text-processing utility program

Shell-style text processing in Go I saw a word game where the puzzle gives you six letters. By means of a clever user interface, you construct words f

Wordle-solver - A simple solver for Wordle puzzles that uses letter- and word-frequencies to narrow down possible guesses

Wordle Solver A simple solver for Wordle puzzles that uses letter- and word-freq

Owner
Andy Song
Andy Song
The shamoji (杓文字) is a word filtering package

shamoji About The shamoji (杓文字) is word filtering package. Install $ go get -u github.com/osamingo/shamoji Usage package main import ( "fmt" "sync

Osamu TONOMORI 13 Sep 27, 2022
Gopher-translator - A HTTP API that accepts english word or sentences and translates them to Gopher language

Gopher Translator Service An interview assignment project. To see the full assig

Teodor Draganov 0 Jan 25, 2022
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Aaron Groves 29 Sep 27, 2022
A Go port of the Rapid Automatic Keyword Extraction algorithm (RAKE)

A Go implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010).

Abdullah Joseph 98 Oct 21, 2022
Eunomia is a distributed application framework that support Gossip protocol, QuorumNWR algorithm, PBFT algorithm, PoW algorithm, and ZAB protocol and so on.

Introduction Eunomia is a distributed application framework that facilitates developers to quickly develop distributed applications and supports distr

Cong 2 Sep 28, 2021
Multiple databases, read-write splitting FOR GORM

DBResolver DBResolver adds multiple databases support to GORM, the following features are supported: Multiple sources, replicas Read/Write Splitting A

null 270 Nov 22, 2022
Small tool for splitting files found in a path into multiple groups

Small tool for splitting files found in a path into multiple groups. Usefull for parallelisation of whatever can be paralleled with multiple files.

Antonio Martinović 0 Jan 30, 2022
A db proxy for distributed transaction, read write splitting and sharding! Support any language! It can be deployed as a sidecar in a pod.

DBPack DBPack means a database cluster tool pack. It can be deployed as a sidecar in a pod, it shields complex basic logic, so that business developme

null 320 Nov 20, 2022
Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

gse Go efficient text segmentation; support english, chinese, japanese and other. 简体中文 Dictionary with double array trie (Double-Array Trie) to achiev

ego 2.1k Nov 27, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

ZoomIO 24 Sep 27, 2022