Self-contained Japanese Morphological Analyzer written in pure Go

Overview

GoDev Go Coverage Status Docker Images Docker Pulls deploy demo

Kagome v2

Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.

Improvements from v1.

  • Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
  • Brushed up and added several APIs.

Dictionaries

dict source package
MeCab IPADIC mecab-ipadic-2.7.0-20070801 github.com/ikawaha/kagome-dict/ipa
UniDIC unidic-mecab-2.1.2_src github.com/ikawaha/kagome-dict/uni

Experimental Features

dict source package
mecab-ipadic-NEologd mecab-ipadic-neologd github.com/ikawaha/kagome-ipa-neologd
Korean MeCab mecab-ko-dic-2.1.1-20180720 github.com/ikawaha/kagome-dict-ko

Segmentation mode for search

Kagome has segmentation mode for search such as Kuromoji.

  • Normal: Regular segmentation
  • Search: Use a heuristic to do additional segmentation useful for search
  • Extended: Similar to search mode, but also uni-gram unknown words
Untokenized Normal Search Extended
関西国際空港 関西国際空港 関西 国際 空港 関西 国際 空港
日本経済新聞 日本経済新聞 日本 経済 新聞 日本 経済 新聞
シニアソフトウェアエンジニア シニアソフトウェアエンジニア シニア ソフトウェア エンジニア シニア ソフトウェア エンジニア
デジカメを買った デジカメ を 買っ た デジカメ を 買っ た デ ジ カ メ を 買っ た

Programming example

package main

import (
	"fmt"
	"strings"

	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("すもももももももものうち")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("すもももももももものうち")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

output:

---wakati---
[すもも も もも も もも の うち]
---tokenize---
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

Reference

実践:形態素解析 kagome v2

Commands

Install

Go

env GO111MODULE=on go get -u github.com/ikawaha/kagome/v2

Homebrew tap

brew install ikawaha/kagome/kagome

Usage

$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
   [tokenize] - command line tokenize (*default)
   server - run tokenize server
   lattice - lattice viewer
   version - show version

tokenize [-file input_file] [-dict dic_file] [-userdict userdic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)]
  -dict string
    	dict
  -file string
    	input file
  -mode string
    	tokenize mode (normal|search|extended) (default "normal")
  -simple
    	display abbreviated dictionary contents
  -sysdict string
    	system dict type (ipa|uni) (default "ipa")
  -udict string
    	user dict

Tokenize command

% kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Server command

API

Start a server and try to access the "/tokenize" endpoint.

% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}' | jq . 

Web App

demo

Start a server and access http://localhost:6060. (To draw a lattice, demo application uses graphviz . You need graphviz installed.)

% kagome server &

Lattice command

A debug tool of tokenize process outputs a lattice in graphviz dot format.

% kagome lattice 私は鰻 | dot -Tpng -o lattice.png

lattice

Docker

Docker

Licence

MIT

Issues
  • There is nothing better than better documentation

    There is nothing better than better documentation

    KEINOS Thank you very much! Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.

    Originally posted by @CaptainDario in https://github.com/ikawaha/kagome/issues/274#issuecomment-1198047786

    opened by ikawaha 3
Releases(v2.8.0)
Owner
ikawaha
gopher
ikawaha
Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

gse Go efficient text segmentation; support english, chinese, japanese and other. 简体中文 Dictionary with double array trie (Double-Array Trie) to achiev

ego 2k Jul 29, 2022
Natural language detection package in pure Go

getlang getlang provides fast natural language detection in Go. Features Offline -- no internet connection required Supports 29 languages Provides ISO

Rylan 138 Jul 28, 2022
i18n (Internationalization and localization) engine written in Go, used for translating locale strings.

go-localize Simple and easy to use i18n (Internationalization and localization) engine written in Go, used for translating locale strings. Use with go

Miles Croxford 38 Jul 17, 2022
:tophat: Small self-contained pure-Go web server with Lua, Markdown, HTTP/2, QUIC, Redis and PostgreSQL support

Web server with built-in support for QUIC, HTTP/2, Lua, Markdown, Pongo2, HyperApp, Amber, Sass(SCSS), GCSS, JSX, BoltDB (built-in, stores the databas

Alexander F. Rødseth 2k Jul 31, 2022
A fully self-contained Nmap like parallel port scanning module in pure Golang that supports SYN-ACK (Silent Scans)

gomap What is gomap? Gomap is a fully self-contained nmap like module for Golang. Unlike other projects which provide nmap C bindings or rely on other

jtimperio 69 Jul 13, 2022
Nginx-Log-Analyzer is a lightweight (simplistic) log analyzer for Nginx.

Nginx-Log-Analyzer is a lightweight (simplistic) log analyzer, used to analyze Nginx access logs for myself.

Mao Mao 22 Jul 23, 2022
Log-analyzer - Log analyzer with golang

Log Analyzer what do we have here? Objective Installation and Running Applicatio

Lawrence Agbani 0 Jan 27, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

If you like the project, please ★ star this repository to show your support! ?? A Machine Learning library written in pure Go designed to support rele

NLP Odyssey 1.2k Aug 1, 2022
A tool for generating self-contained, type-safe test doubles in go

counterfeiter When writing unit-tests for an object, it is often useful to have fake implementations of the object's collaborators. In go, such fake i

Max Brunsfeld 707 Jul 30, 2022
Product Analytics, Business Intelligence, and Product Management in a fully self-contained box

Engauge Concept It's not pretty but it's functional. Track user interactions in your apps and products in real-time and see the corresponding stats in

Engauge 93 Nov 17, 2021
Nightly binary builds of Emacs for macOS as a self-contained Emacs.app, with native-compilation.

Emacs Builds Nightly binary builds of Emacs for macOS as a self-contained Emacs.app, with native-compilation. Features Self-contained Emacs.app applic

Jim Myhrberg 193 Jul 29, 2022
A tiny self-contained pasting service with a built-in database.

A tiny self-contained pasting service with a built-in database.

SWZ 2 Dec 18, 2021
Self-contained Machine Learning and Natural Language Processing library in Go

Self-contained Machine Learning and Natural Language Processing library in Go

NLP Odyssey 1.2k Aug 1, 2022
A simple single-file executable to pull a git-ssh repository and serve the web app found to a self-contained browser window

go-git-serve A simple single-file executable to pull a git-ssh repository (using go-git library) and serve the web app found to a self-contained brows

Justin Searle 0 Jan 19, 2022
Simple HTTP/HTTPS proxy - designed to be distributed as a self-contained binary that can be dropped in anywhere and run.

Simple Proxy This is a simple HTTP/HTTPS proxy - designed to be distributed as a self-contained binary that can be dropped in anywhere and run. Code b

Jamie Thompson 13 May 9, 2022
Go efficient text segmentation and NLP; support english, chinese, japanese and other. Go 语言高性能分词

gse Go efficient text segmentation; support english, chinese, japanese and other. 简体中文 Dictionary with double array trie (Double-Array Trie) to achiev

ego 2k Jul 29, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

ZoomIO 23 Jul 17, 2022
An easy-to-use OCR and Japanese to English translation tool

Manga Translator An easy-to-use application for translating text in images from Japanese to English. The GUI was created using Gio. Gio supports a var

Cameron Kinsella 38 Aug 2, 2022
Tools to help with Japanese sentence mining

Tools to help with Japanese sentence mining

Anton Van Eechaute 2 Dec 10, 2021
qclean lets you to clean up search query in japanese.

qclean qclean lets you to clean up search query in japanese. This is mainly used to remove wasted space. Quick Start package main var cleaner *qclean

po3rin 0 Jan 4, 2022
Disk usage analyzer with console interface written in Go

Gdu is intended primarily for SSD disks where it can fully utilize parallel processing. However HDDs work as well, but the performance gain is not so huge.

Daniel Milde 1.9k Aug 8, 2022
The imghdr module determines the type of image contained in a file for go

goimghdr Inspired by Python's imghdr Installation go get github.com/corona10/goimghdr List of return value Value Image format "rgb" SGI ImgLib Files

Dong-hee Na 38 Jan 23, 2022
Sort the emails contained in a .csv file into a text file

Go convert csv to txt This snippet of code allows you to sort the emails contained in a .csv file into a text file.

Maxence Brochier 1 Nov 23, 2021
containedctx detects is a linter that detects struct contained context.Context field

containedctx containedctx detects is a linter that detects struct contained context.Context field Instruction go install github.com/sivchari/contained

sivchari 9 Jun 19, 2022
Groupie Trackers consists on receiving a given API and manipulate the data contained in it, in order to create a site, displaying the information.

groupie-tracker Objectives Groupie Trackers consists on receiving a given API and manipulate the data contained in it, in order to create a site, disp

Mustafa Ustaz 0 Jan 13, 2022
H265/HEVC Bitstream Analyzer in Go

GoVisa H.265/HEVC Bitstream Analyzer in Go /* The copyright in this software is being made available under the BSD License, included below. This softw

Rain Liu 28 May 28, 2022
Analyzer: helps uncover bugs by reporting a diagnostic for mistakes of *sql.Rows usage.

sqlrows sqlrows is a static code analyzer which helps uncover bugs by reporting a diagnostic for mistakes of sql.Rows usage. Install You can get sqlro

GoStaticAnalysis 86 Mar 24, 2022
A static code analyzer for annotated TODO comments

todocheck todocheck is a static code analyzer for annotated TODO comments. It let's you create actionable TODOs by annotating them with issues from an

Preslav Mihaylov 386 Aug 4, 2022