Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text/tabwriter: character width #8273

Closed
rui314 opened this issue Jun 23, 2014 · 16 comments
Closed

text/tabwriter: character width #8273

rui314 opened this issue Jun 23, 2014 · 16 comments

Comments

@rui314
Copy link
Member

rui314 commented Jun 23, 2014

What steps will reproduce the problem?
Issue:
gofmt, or text.tabwriter, assumes that all Unicode code points occupy exactly one column
in editors or on terminals. That assumption is not correct because most (but not all)
Chinese/Japanese/Korean characters, emojis, "fullwidth" Latin characters, etc,
occupy two columns. As a result gofmt formats Go code like this.

var Countries = map[string]string{
        "アメリカ合衆国": "United States of America",
        "日本":      "Japan",
        "ドイツ":     "Germany",
        "フランス":    "France",
        "ポーランド":   "Poland",
}

As you can see the column of the map value is misaligned. You cannot fix this by hand
because gofmt would reformat it for you in the wrong way if you do that. That's annoying.

In Unicode, there's a zero column character (ZERO WIDTH SPACE; U+200B). SOFT HYPHEN
(U+00AD) may be displayed as a hyphen at the end of a line but may be zero-width in
other places, depending on your display environment. These chracters also affect the
column layout.

What is the expected output? What do you see instead?
Proposal:
Unicode Standard Annex #11 gives the definition of column width for characters in the
legacy East Asian character sets. I propose to add the East Asian Width property to the
unicode package, so that we can get the column width for a CJK character. East Asian
Fullwidth and East Asian Wide characters should be treated as two column by tabwriter.

(Note: East Asian Ambiguous characters need to be treated as one column. They are
treated as two columns only in East Asian display environment. The character set
contains Cyrillic characters and others which we would never want to handle as two
column.)

Because the Annex #11 does not say anything about characters that are not in the legacy
East Asian character sets, we need additional rules for characters not in CJK character
sets but in Unicode. I propose this simple rule:

 - ZERO WIDTH SPACE is 0 column
 - Emojis are 2 columns
 - Other code points, including U+0000, SOFT HYPHEN, and all control characters, are 1 column

This additional rule will be implemented to an unexported function in text.tabwriter.

Caveats:
I deliberately avoid defining the generic "wcswidth" function to determine the
column width for a string in the standard library. That function can never be defined in
the right way because there's no standard for it. Also it'd be hard to get a reasonable
definition for characters with odd semantics, such as SOFT HYPHEN.
@ianlancetaylor
Copy link
Contributor

Comment 1:

Labels changed: added repo-main, release-none.

@griesemer
Copy link
Contributor

Comment 2:

This as a (relatively minor) change to the tabwriter so that it can handle single and
double-width characters based on the fixed (_font-independent_) Unicode Annex #11 width
information, and assuming that the layout is for fixed-width (and multiples of the
fixed-width) characters.
It is an explicit non-goal to make the tabwriter work for variable-width fonts at this
time (it is possible, but it only makes sense in context with an IDE which lays out code
depending on font size).

Owner changed to @griesemer.

Status changed to Thinking.

@clausecker
Copy link

I see support for full-width characters as something integral to this package. It would be a bit sad if we left many millions of users in countries that use CJK characters without a usably text/tabwriter package.

@clausecker
Copy link

For an example implementation of a function to figure out how many columns a character occupies, see https://github.com/mattn/go-runewidth.

@imuli
Copy link

imuli commented May 25, 2015

For what it's worth, this is also a problem with combining characters (and not all meaningful combinations have canonical forms):

var test = map[string]int{
    "tes̪t":   0,
    "testing": 0,
}

@clausecker
Copy link

@imuli Combining characters are handled as if their width is 0. This is fine if the code will never introduce a line-break before a combining character (which it doesn't). There's no need for canonical forms as this scheme works just fine.

@imuli
Copy link

imuli commented May 26, 2015

@fuzxxl Yes, the package you linked to handles combining characters just fine. I meant that they are another side of this bug however, one that perhaps doesn't fall under "variable width font".

@XenoPhex
Copy link

Any updates on this?

@griesemer
Copy link
Contributor

@griesemer
Copy link
Contributor

PS: Even if the package were not frozen, we are not going to add specific character sets or tables to this package for special treatment. The only sensible approach would be to provide a function that given a Unicode char returns a width, leaving the actual width determination to a client. However, the only way we could add such a function is by extending the API; specifically it would probably require a new Init function.

This package is one of the earliest Go packages with some features (like HTML filtering) that are not needed/used anymore (at least by gofmt). We are not going to make further changes at this point.

If you need a special version, you can always vendor and adjust the code. A future gofmt might use a rewritten and trimmed version of this package. None of this is high-priority.

I will close this issue.

@mattn
Copy link
Member

mattn commented Jan 31, 2017

FYI: one another ways to do it. https://github.com/olekukonko/tablewriter

@golang golang locked and limited conversation to collaborators Jan 31, 2018
@mpvl
Copy link
Contributor

mpvl commented Apr 5, 2018

@griesemer thanks for pointing me to this issue. This comes up once in a while.

The width information you need is already in golang.org/x/text/width. An implementation is not straightforward, though, as width cannot be determined unambiguously:

  1. The width of fullwidth characters in a monospace font depends on the font. For East Asian fonts, a halfwidth character is indeed exactly half the width of a fullwidth character. A monospace Latin latin font, however, the ratio is typically, but not always, 3:5.
  2. There are ambiguous characters for which it unknown whether a font will typically render them as fullwidth or halfwidth.
  3. Some editors will modifiers explicitly (although arguably increasingly rarely).
  4. Spacing may vary as Unicode gets updated.

Now arguably, with the current implementation will never align properly for anybody if any non-halfwidth rune is used. One could at least:

  1. Render things properly for East Asian fonts if fullwidth characters are used.
  2. Render things properly for any font if modifiers are used.

I've implemented an algorithm to determine a string width based on some experience-based recommendation for interpretation of ambiguous characters for exactly this purpose. It was decided not to add this way back then, but perhaps in light of Go 2 the willingness to change things have increased.

The main drawbacks of this approach:

  1. It makes gofmt depend on x/text. But so does core.
  2. Indentation may change as Unicode (and the go compiler) is updated.

The last one may be nasty if people are collaborating on the same project using different versions of gofmt. Gofmt would probably need some kind of logic to prevent flipping back and forth between two different interpretations and allow a flag to force an update.

Another complementary approach is to allow line breaks after table values so that values are indented and spaced independently of the keys.

Personally I think the best would be to rely on editors to do the outlining correctly, but have a best-effort implementation with some amount of stability guarantees that will render the alignment correctly in the majority of cases (albeit a small majority, I guess 66%). Note that even though no implementation will get the indentation right for everybody, it will at least at least do the right thing in many situations whereas now it is guaranteed to never do the right thing.

@griesemer
Copy link
Contributor

@mpvl Thanks for the info. I don't think we want to be dependent on x/text. I was hoping that there might be a small number of unicode code point ranges that we could trivially detect (and that are unlikely to change in the future) to identify full-width/wide chars and just give them the space of 2 characters. Of course this all depends on the actual font used during rendering and so this assumes that wide characters are taking the space of 2 regular characters in that font.

@mpvl
Copy link
Contributor

mpvl commented Apr 5, 2018

@griesemer: I don't think that is a scalable approach and leaves out handling zero-width characters, which is easier actually. We could do something similar what is done for core though: generate the tables in x/text and then copy them in to gofmt. x/text has been set up to automate this.

@mpvl
Copy link
Contributor

mpvl commented May 1, 2018

At my visit to Gopher China I did some polling and almost exclusively people were using the preinstalled fonts for their editor (VSCode etc.) or a variant that would result in a CJK to Latin ratio of 5:3. Only sporadically somebody reported indeed using the traditional 2:1 ratio.

IOW, it seems that adopting a 2:1 ratio will not fix the problem for the majority of the people. Conversely, adopting a 5:3 ratio would seem to do the trick, but it would also result in some peculiar artifacts in the gofmt rendering. I'm not sure that it is worth it. This doesn't preclude providing better handling for modifiers, of course. Emojis:Latin is typically also 5:3.

Admittedly, my sample size was small (about 20), so I can Asta do a more large-scale poll, but I wouldn't hold my breath.

It seems that having editor plugins to handle this really the most ideal approach.

@griesemer
Copy link
Contributor

Thanks, @mpvl, that is useful additional input. It sounds like there's no simple solution to address this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants