wcwidth for golang

Overview

go-runewidth

Build Status Codecov GoDoc Go Report Card

Provides functions to get fixed width of the character or string.

Usage

runewidth.StringWidth("つのだ☆HIRO") == 12

Author

Yasuhiro Matsumoto

License

under the MIT License: http://mattn.mit-license.org/2013

Comments
  • Fixed StringWidth() implementation by using proper Unicode grapheme cluster segmentation. Fixes #28

    Fixed StringWidth() implementation by using proper Unicode grapheme cluster segmentation. Fixes #28

    This introduces an implementation of StringWidth() using Unicode grapheme clusters which should be the correct way to split a string into its individual characters. The built-in assumption is that if we have combined runes (emojis, flags etc.), their width is the width of the first non-zero-width rune. Many of these combined runes were previously not handled correctly by this package.

    Please note:

    • This introduces a dependency: https://github.com/rivo/uniseg/ (I don't see another way to make this work short of copying most code from rivo/uniseg over.)
    • No special zero-width joiner (ZWJ) handling is required anymore. Because code and especially configuration options related to ZWJ were thus removed, this branch is not backwards-compatible to the previous commit.
    • The changes result in a failure of the TestStringWidth test case but only the part where EastAsianWidth = true. I'm not very familiar with this flag so I don't know how to fix that. You may want to review this.
    opened by rivo 36
  • Use unicode9 character widths

    Use unicode9 character widths

    Update runewidth to use unicode9 character width tables. This is the default in vim and neovim now, so should be safe to use in any terminal.

    The Condition is no longer calculated using IsEastAsian() as terminals do not use locale to determine how wide to draw ambiguous with characters. Rather, they simply default to 1, and may offer an option to set to 2 (which is discouraged).

    All tests still pass. The API is very close to the way it was, but not identical due to the change in Condition.

    I recognize this is a fairly large diff, so I'd be happy to work with you in any way you feel best to get this merged.

    opened by joshuarubin 27
  • The width of Box-drawing characters

    The width of Box-drawing characters

    Check this for the definition of box-drawing (BD below) characters.

    I found that these characters are defined to be of ambiguous width, so passing these to RuneWidth returns 2 in my environment. This is somehow inconvenient since AFAIK, terminal fonts tend to interpret BD characters in half-width.

    Is it possible to remove these characters from the ambiguous table? I can make the PR if you think this sounds sane.

    Thanks.

    opened by NonerKao 14
  • Regional Indicators (Flags) and Grapheme Clusters

    Regional Indicators (Flags) and Grapheme Clusters

    Here's a short example that illustrates an issue with flags (or "regional indicators"):

    fmt.Println(runewidth.StringWidth("🇩🇪")) // Should be "2", outputs "4".
    

    The flag consists of two code points which are processed separately by runewidth. But most modern systems will combine them into one flag emoji.

    This is part of a larger topic which I describe in more detail here: gdamore/tcell#264. It doesn't just affect flags but also characters in e.g. Arabic and Korean where there are more sophisticated rules than "combining characters" and zero-width joiners (which you added with #20).

    I don't know exactly how you calculate the widths of characters. I'm also not sure how you would solve flags as well as some of the other rules described in the Unicode specification but it would sure be nice as printing these flags currently gives me trouble in tview. There have been multiple issues asking for better support for different languages and emojis so it seems that there are quite a few people who use the terminal with these characters.

    (Maybe my new package uniseg can help you here.)

    opened by rivo 13
  • Feature request: Add support for zero-width-joiners

    Feature request: Add support for zero-width-joiners

    It would be great if you could add support for zero-width joiners (ZWJ). I have the following code example which doesn't work as expected:

    package main
    
    import (
    	"fmt"
    
    	runewidth "github.com/mattn/go-runewidth"
    )
    
    func main() {
    	e := "👨‍👨‍👧"
    	r := []rune(e)
    	var widths []int
    	for _, c := range r {
    		widths = append(widths, runewidth.RuneWidth(c))
    	}
    	fmt.Printf("%s : len=%d numrunes=%d width=%d widths=%v runes=%X\n", e, len(e), len(r), runewidth.StringWidth(e), widths, r)
    }
    

    The output is:

    👨‍👨‍👧 : len=18 numrunes=5 width=6 widths=[2 0 2 0 2] runes=[1F468 200D 1F468 200D 1F467]
    

    Specifically, width should be 2 instead of 6. I found this article which explains how they work. It does not only affect emojis but also characters in some languages.

    This came up in rivo/tview#161. It would be great if support for ZWJ could be added so I can implement support for these Unicode characters in tview. I understand that not all kinds of combinations are supported and it's probably difficult to figure out which ones are. But assuming these characters are supported will help a lot. I don't expect users to try to print ZWJ combinations which are not supported anyway.

    Thanks!

    opened by rivo 8
  • Is width for EN DASH intended to be 2 instead of 1?

    Is width for EN DASH intended to be 2 instead of 1?

    Hi,

    Consider the following three similar unicode characters:

    '-' - Unicode Character 'HYPHEN-MINUS' (U+002D)
    '–' - Unicode Character 'EN DASH' (U+2013)
    '—' - Unicode Character 'EM DASH' (U+2014)
    

    From https://github.com/shurcooL/markdownfmt/issues/7#issuecomment-46792756, I've learned that go-runewidth considers the width of the first character to be 1, and the width of second and third characters to be 2.

    Is that intended?

    I'm not sure how to test this reliably, but in most environments it seems that EN DASH has width that's closer to 1 than 2.

    Any thoughts on this?

    opened by dmitshur 8
  • Add full lookup table for single rune width.

    Add full lookup table for single rune width.

    Provides nearly an order of magnitude speedup depending on how quickly the checks are done.

    Data is packed at 4 bytes/rune, since the max output value is 2.

    cpu: AMD Ryzen 9 3950X 16-Core Processor
    BenchmarkRuneWidthAll/regular-32        	      51	  25539433 ns/op	       0 B/op	       0 allocs/op
    BenchmarkRuneWidthAll/lut-32            	     442	   2711694 ns/op	       0 B/op	       0 allocs/op
    BenchmarkRuneWidth768/regular-32        	  617528	      2109 ns/op	       0 B/op	       0 allocs/op
    BenchmarkRuneWidth768/lut-32            	  605570	      2038 ns/op	       0 B/op	       0 allocs/op
    BenchmarkRuneWidthAllEastAsian/regular-32         	      31	  36469868 ns/op	       0 B/op	       0 allocs/op
    BenchmarkRuneWidthAllEastAsian/lut-32             	     442	   2710229 ns/op	       0 B/op	       0 allocs/op
    BenchmarkRuneWidth768EastAsian/regular-32         	   73273	     16028 ns/op	       0 B/op	       0 allocs/op
    BenchmarkRuneWidth768EastAsian/lut-32             	  634987	      1871 ns/op	       0 B/op	       0 allocs/op
    PASS
    
    opened by klauspost 5
  • Rune width of certain CP437 chars like ♦ is 2 instead of 1

    Rune width of certain CP437 chars like ♦ is 2 instead of 1

    I'm trying to port an old DOS program using tcell (which uses RuneWidth). My program has a table mapping CP437 char code to rune, and then I print that rune to the screen. I'm in the terminal with fixed width fonts, so I expect all chars to be the same width.

    The issue is RuneWidth('\u2666') and some other characters is returning width 2 instead of 1, which makes tcell allocate 2 chars for it and causes "gaps" in the rendering. Here's playground code showing which chars do this: https://play.golang.org/p/Hjq3GOC0Pcd -- output is:

    RuneWidth('☺') = 2
    RuneWidth('☻') = 2
    RuneWidth('♥') = 2
    RuneWidth('♦') = 2
    RuneWidth('♣') = 2
    RuneWidth('♠') = 2
    RuneWidth('♂') = 2
    RuneWidth('♀') = 2
    RuneWidth('♪') = 2
    RuneWidth('♫') = 2
    RuneWidth('☼') = 2
    RuneWidth('↕') = 2
    RuneWidth('‼') = 2
    RuneWidth('↔') = 2
    

    I believe it's happening because these are treated as Emoji characters. Is this behavior expected? If so, how do I work around this in tcell?

    opened by benhoyt 5
  • incorrect rune width for box drawing characters in east asian encoding

    incorrect rune width for box drawing characters in east asian encoding

    When using an east asian encoding, the following runes are given a width of 2 but they should be 1: ─┌└┐┘│.

    To reproduce:

    export LC_CTYPE="ja_JP.UTF-8"
    (in go program)
    runewidth.RuneWidth('─') // returns 2
    

    looking at the runewidth_table.go file, the culprit is {0x24EB, 0x254B} in the ambiguous table. I'm not sure how to update this; the file is auto-generated.

    In terminal apps which render box characters this can lead to broken rendering: image

    Let me know if there's anything else I can add. Thanks :)

    opened by jesseduffield 4
  • Interpret

    Interpret "na" as narrow

    According to tr11, the "na" designation stands for "East Asian Narrow". This includes both ASCII characters as well as some non-ASCII characters like double angle brackets. Previous versions of go-runewidth (v0.0.4) assigned these a width of one, but more recent versions assign it a width of zero. I noticed this because tcell versions after this commit refused to display the double angle bracket characters.

    The fix proposed in this PR is to assign characters with the "na" designation a width of one.

    opened by wedaly 4
  • EastAsianWidth.txt, use the record with

    EastAsianWidth.txt, use the record with "F"(FULLWIDTH) field as same as "W"

    For the issue possible regression ? · Issue #32 · mattn/go-runewidth.

    On the script/generate.go reading https://unicode.org/Public/12.1.0/ucd/EastAsianWidth.txt , the record which has "F"(FullWidth) field is not used as same as "W". So, I tried to fix to support and update checksums (Is it to remind go generate ?)

    Would you like to merge if there are no problems ?

    opened by hymkor 4
  • Upgraded rivo/uniseg to latest version, switched StringWidth/Truncate to speedier version

    Upgraded rivo/uniseg to latest version, switched StringWidth/Truncate to speedier version

    The rivo/uniseg package has received a major update which also includes methods for grapheme cluster parsing that are much faster than the previously used Graphemes class.

    I've upgraded your package accordingly and updated the relevant code to use these faster methods. It would be great if you could merge these changes.

    Thank you!

    ps. I noticed that some automatic checks did not complete successfully because they are still running on Go 1.15. Would you like me to look into upgrading them to the current version (1.18)?

    opened by rivo 3
  • Width is 1 when it should be 2

    Width is 1 when it should be 2

    I stumbled over a character that, when output to the console directly, takes up two characters. But StringWidth() gives me 1. This is because the first rune of this character has a width of 1 and that's what's being used, see here. I know I wrote this code and I'm sure that you cannot simply add up the widths of individual runes ("🏳️‍🌈" would then have a width of 4 which is obviously wrong) and using the first rune's width worked fine so far. But it turns out that it fails in some cases.

    I'm not familiar with Indian characters but it seems to me that the second rune is a modifier that turns the character from a width of 1 into a width of 2. Are you aware of any logic that we could add to go-runewidth that makes this right?

    Here's example code that illustrates the issue:

    package main
    
    import (
    	"fmt"
    
    	runewidth "github.com/mattn/go-runewidth"
    )
    
    func main() {
    	s := "खा"
    	fmt.Println("0123456789")
    	fmt.Println(s + "<")
    	fmt.Printf("String width: %d\n", runewidth.StringWidth(s))
    	var i int
    	for _, r := range s {
    		fmt.Printf("Rune %s  (%d) width: %d\n", string(r), i, runewidth.RuneWidth(r))
    		i++
    	}
    }
    

    Output (on macOS with iTerm2):

    image
    opened by rivo 16
  • Semantic Versioning: `ZeroWidthJoiner` Removal

    Semantic Versioning: `ZeroWidthJoiner` Removal

    ZeroWidthJoiner was removed after v0.0.9: https://github.com/mattn/go-runewidth/blob/v0.0.9/runewidth.go#L14

    The next version was v0.0.10, but this introduced a breaking API change.

    While being v0 means you can introduce breaking API changes, would it be possible to get a v1 release that can ensure API stability?

    It's fine to just keep cutting new versions when API changes happen, but right now it makes managing Go Module dependencies rather painful, since it just assumes patch versions don't introduce breaking changes.

    opened by trynity 3
  • RuneWidth does not equal StringWidth

    RuneWidth does not equal StringWidth

    I stumbled over this while working on #47.

    It seems that RuneWidth is not always equal to the StringWidth of a single rune.

    This is quite unexpected, TBH.

    Please see https://github.com/markus-oberhumer/mattn--go-runewidth/commit/5da511d36b1ea1ad913590b7b27357e5fffd3512 for a test case.

    opened by markus-oberhumer 6
  • added power support arch ppc64le on yml file.

    added power support arch ppc64le on yml file.

    Added power support for the travis.yml file with ppc64le. This is part of the Ubuntu distribution for ppc64le. This helps us simplify testing later when distributions are re-building and re-releasing. For more info tag @gerrith3.

    opened by srinivas32 0
Owner
mattn
Long-time Golang user&contributor, Google Dev Expert for Go, and author of many Go tools, Vim plugin author. Windows hacker C#/Java/C/C++
mattn
golang 在线预览word,excel,pdf,MarkDown(Online Preview Word,Excel,PPT,PDF,Image by Golang)

Go View File 在线体验地址 http://39.97.98.75:8082/view/upload (不会经常更新,保留最基本的预览功能。服务器配置较低,如果出现链接超时请等待几秒刷新重试,或者换Chrome) 目前已经完成 docker部署 (不用为运行环境烦恼) Wor

CZC 71 Sep 19, 2022
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Microcosm 2.4k Sep 27, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 17.7k Sep 20, 2022
A golang package to work with Decentralized Identifiers (DIDs)

did did is a Go package that provides tools to work with Decentralized Identifiers (DIDs). Install go get github.com/ockam-network/did Example packag

Ockam 65 Sep 16, 2022
Parses the Graphviz DOT language in golang

Parses the Graphviz DOT language and creates an interface, in golang, with which to easily create new and manipulate existing graphs which can be writ

Walter Schulze 495 Sep 24, 2022
Go (Golang) GNU gettext utilities package

Gotext GNU gettext utilities for Go. Features Implements GNU gettext support in native Go. Complete support for PO files including: Support for multil

Leonel Quinteros 352 Sep 18, 2022
htmlquery is golang XPath package for HTML query.

htmlquery Overview htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression. htmlque

null 518 Sep 21, 2022
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

JF Technology 494 Sep 27, 2022
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Pagser Pagser inspired by page parser。 Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and str

foolin 69 Sep 14, 2022
iTunes and RSS 2.0 Podcast Generator in Golang

podcast Package podcast generates a fully compliant iTunes and RSS 2.0 podcast feed for GoLang using a simple API. Full documentation with detailed ex

Eric Duncan 110 Aug 1, 2022
TOML parser for Golang with reflection.

THIS PROJECT IS UNMAINTAINED The last commit to this repo before writing this message occurred over two years ago. While it was never my intention to

Andrew Gallant 4k Sep 22, 2022
Your CSV pocket-knife (golang)

csvutil - Your CSV pocket-knife (golang) #WARNING I would advise against using this package. It was a language learning exercise from a time before "e

Bryan Matsuo 45 Feb 7, 2022
Golang (Go) bindings for GNU's gettext (http://www.gnu.org/software/gettext/)

gosexy/gettext Go bindings for GNU gettext, an internationalization and localization library for writing multilingual systems. Requeriments GNU gettex

Go toolbelt 61 Jun 6, 2022
agrep-like fuzzy matching, but made faster using Golang and precomputation.

goagrep There are situations where you want to take the user's input and match a primary key in a database. But, immediately a problem is introduced:

Zack 41 Nov 30, 2021
Ngram index for golang

go-ngram N-gram index for Go. Key features Unicode support. Append only. Data can't be deleted from index. GC friendly (all strings are pooled and com

Eugene Lazin 103 Aug 26, 2022
String-matching in Golang using the Knuth–Morris–Pratt algorithm (KMP)

gokmp String-matching in Golang using the Knuth–Morris–Pratt algorithm (KMP). Disclaimer This library was written as part of my Master's Thesis and sh

Patrick-Ranjit D. Madsen 39 Sep 9, 2022
Golang HTML to plaintext conversion library

html2text Converts HTML into text of the markdown-flavored variety Introduction Ensure your emails are readable by all! Turns HTML into raw text, usef

J. Elliot Taylor 431 Sep 23, 2022
Handlebars for golang

raymond Handlebars for golang with the same features as handlebars.js 3.0. The full API documentation is available here: http://godoc.org/github.com/a

Aymerick 493 Sep 27, 2022
Package sanitize provides functions for sanitizing text in golang strings.

sanitize Package sanitize provides functions to sanitize html and paths with go (golang). FUNCTIONS sanitize.Accents(s string) string Accents replaces

Kenny Grant 320 Sep 13, 2022