Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/doc: reconsider comment rewrites of '' to #54312

Open
kortschak opened this issue Aug 6, 2022 · 43 comments
Open

go/doc: reconsider comment rewrites of '' to #54312

kortschak opened this issue Aug 6, 2022 · 43 comments

Comments

@kortschak
Copy link
Contributor

kortschak commented Aug 6, 2022

In ae3d890 as part of addressing #51082, a change was made to go/doc that rewrite all '' to . This makes semantic changes to comments where ' is used as a prime and '' is used as double prime, common in mathematical code.

This was raised in #51082 (comment) but essentially disregarded. A work around was suggested,

as a workaround, you could replace those with U+2032 PRIME and U+2033 DOUBLE PRIME. That is, f'f′ and f''f″. (Those may look the same depending on your font.)

However on investigation with relevant fonts (the font here and the font used by pkg.go.dev) at 100%, U+2033 DOUBLE PRIME is barely distinguishable from U+2032 PRIME and worse, also barely distinguishable from other commonly used marks in the same position such as '*' (Comparison: f′ f″ f* ).

The change has made it harder to read these comments, harder to write them in a way that doesn't get mutated and easier for incorrectly formatted comments to be committed (f'' getting mutated to f” which is essentially indistinguishable from f″ at normal font sizes).

@mvdan
Copy link
Member

mvdan commented Aug 6, 2022

f'' getting mutated to f” which is essentially indistinguishable from f″ at normal font sizes

I had to put my glasses on to see the difference at my regular zoom level, so I think you're right :) And this is on a 25" monitor, let alone a tiny phone screen.

@seankhliao seankhliao added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Aug 6, 2022
@seankhliao
Copy link
Member

cc @rsc

@zikaeroh
Copy link
Contributor

zikaeroh commented Aug 6, 2022

Another example is when docs refer to the empty string, specifically JS where strings aren't necessarily double quoted, leading to undesirable doc replacements (and apparently, inserting quotes at the start/end of paragraphs that are not quotes?) as in: evanw/esbuild@296870e

@ALTree
Copy link
Member

ALTree commented Oct 22, 2022

An example from #56380 (unable to write '' in a comment, closed as a dup of this one):

Since Go 1.19 (commit ae3d890) '' is unconditionally transformed to or . This means my comment:

// lexMultilineRawString consumes a raw string. Nothing can be escaped in such
// a string. It assumes that the beginning ''' has already been consumed and
// ignored.

Now becomes:

// lexMultilineRawString consumes a raw string. Nothing can be escaped in such
// a string. It assumes that the beginning ”' has already been consumed and
// ignored.

There are some other cases, especially when parsing things, where one might reasonably want to type '' inside a regular (non-codeblock) comment.

Looking at go/doc/comment/parse.go, there is no way to escape this behaviour, and I don't really see an obvious non-ugly way to write that comment.

@evanw
Copy link

evanw commented Dec 8, 2022

I recently encountered a particularly annoying case of this behavior. I wanted to write this:

quoteChar should be either '"' or '\''

But no matter what I do, it keeps being turned into this:

quoteChar should be either '"' or '\”

This replacement makes no sense, and is not helpful at all. Really hoping this behavior can be reverted.

@seebs
Copy link
Contributor

seebs commented Jul 14, 2023

So, the more I look at this, the more I think that:

  1. It's just plain wrong to alter quotes like this in documentation comments, even just when rendering.
  2. It's extra special extremely wrong to do it to documentation comments and modify the source code.

I just don't think this actually makes things better. If I want Fancy Quotes, I can type them! We have full UTF-8 support. I do not think the assumption should be that, if someone writing documentation for code uses characters in it that have special significance in thousands of contexts across basically every widely-used programming language, the most likely interpretation is that they're wrong and we should fix their code for them.

I think the assumption should be that if someone wrote an apostrophe, they meant apostrophe, and if they wrote two apostrophes, they meant two apostrophes, and that if they wanted a fancy close-quote, they would have typed one.

There may have been a brief window during which everything could display fancy quotes, but nothing could type them or store them in files, and it thus made sense to try to "correct" them. It's not the case now. If I wanted to type those fancy quotes, I could! I could embed them in things, I could use an editor which inserted them where appropriate, I have lots of options.

What I don't have options for is writing documentation for my code and not having it corrupted by gofmt.

@kortschak
Copy link
Contributor Author

Discussion of this problem has been going on since March 2022 (originally in other issues). It is marked as NeedsDecision and yet no decision maker has even commented here. My full expectation of how this will play out is that when they do turn up, the decision will be that it has been like this for too long to change.

@bcmills
Copy link
Contributor

bcmills commented Jul 17, 2023

It is marked as NeedsDecision and yet no decision maker has even commented here.

The go fmt behavior was changed as part of proposal #51082 (see https://github.com/golang/proposal/blob/master/design/51082-godocfmt.md#reformatting-doc-comments).

Revisions to that behavior should also go through the proposal process, so I am marking this issue as a proposal now.

@bcmills
Copy link
Contributor

bcmills commented Jul 17, 2023

You say:

This was raised in #51082 (comment) but essentially disregarded.

but I don't agree that it was “essentially disregarded” — you received multiple replies from @ianlancetaylor, who is one of the members of the proposal review group, explaining the rationale for the behavior implemented in that proposal.

@bcmills bcmills added Proposal and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Jul 17, 2023
@bcmills bcmills changed the title go/doc: reconsider comment rewrites of '' to proposal: go/doc: reconsider comment rewrites of '' to Jul 17, 2023
@bcmills bcmills modified the milestones: Unplanned, Proposal Jul 17, 2023
@kortschak
Copy link
Contributor Author

you received multiple replies

I received two replies from Ian. The first did not address the issue by error and the second only barely touched on it. I'll have to disagree with the assessment here.

@rsc
Copy link
Contributor

rsc commented Jul 18, 2023

As a point of clarification, the top comment suggests that the rewrite originated in #51082. This is not quite true. The rewrite of `` and '' to and has applied when converting documentation to HTML since before the public Go release. (Go adopted this common convention widely used in text markup, including in Troff and TeX and more recently in some versions of Markdown.)

Part of the work we did in #51082 was have gofmt reformat doc comment text to look the same way it would in the HTML form. So the rewrite is better highlighting in text mode that these comments never had the meaning you thought they did. Overall I believe the text rewrite is a net win: it makes clear to people whether the syntax they've typed means what they think it does, which is to say whether the syntax they've typed will render to HTML as they intend. You've found out that writing f'' for f-double-prime does not render correctly to HTML and never has.

I agree with the quoted comment that the right answer for f-prime is to use the Unicode code points with that meaning. If f″ (Unicode double-prime) doesn't render well, then it would also work to double a single prime: f′′. I don't see this as a workaround at all. I see it as using the Unicode code points with the actual intended meaning.

I also see the problem with writing '\'' and having it turn into '\”. But we also have 15 years of documentation that expects the existing quote conversions, and it seems like a mistake to change that rule after such a long time. What we usually do to avoid the '\'' problem is to write that differently, like saying “double-quote (") and single-quote (')” instead of “'"' and '\''”. The former turns out to be far more readable anyway.

@kortschak wrote:

Discussion of this problem has been going on since March 2022 (originally in other issues). It is marked as NeedsDecision and yet no decision maker has even commented here. My full expectation of how this will play out is that when they do turn up, the decision will be that it has been like this for too long to change.

There are 5k+ open issues as I write this. There are not enough people paid to work on Go to handle all of them. This should have been marked as a proposal, but we mistriaged it. My apologies - mistakes happen. Now it is marked as a proposal. That said, I fear your full expectation is not far off the mark: f'' has had this meaning for 15 years, which probably is too long to change, at least retroactively.

It may be possible to make the change in future versions of Go based on the Go version, so that docs written for Go 1.21 still get the rewrite but docs written using newer versions of Go do not. That would require passing a Go version through to gofmt and also go/doc/comment.Parser, which is non-trivial new API, not to mention a major conceptual change for gofmt. I am not convinced those changes are worth the effort when both of the solutions above lead to improved documentation.

@arp242
Copy link

arp242 commented Jul 18, 2023

As a point of clarification, the top comment suggests that the rewrite originated in #51082. This is not quite true. The rewrite of `` and '' to “ and ” has applied when converting documentation to HTML since before the public Go release. (Go adopted this common convention widely used in text markup, including in Troff and TeX and more recently in some versions of Markdown.)

The current rewriting also applies to unexported comments that never end up in HTML render, such as the example I ran in to (reposted here by ALTree).

The difference between godoc and the other formats you mention is that the others all have a way to write an inline literal '' when need be, without starting a full code paragraph. Godoc currently lacks this. And to make matters worse it's not merely a display issue, but actually mangles the original code. This isn't an "it's a bit awkward to do this" kind of issue, it's a "it's literally impossible to do this, and my comment becomes corrupted if I try" kind of issue.

Simply not applying this for unexported godoc comments might be a simple change that would go some way towards alleviating the pain this causes, although it's not a full fix.

The convention of using ``quotes like this'' is an old-fashioned dying convention that only old unix greybeard-y types use and know about and would surprise many younger people, so I'd argue it's not all that "widely used". I say this as an old unix greybeard-y type that used ``quotes'' habitually for many years. I'm unconvinced it's behaviour worth keeping in the first place – especially since we're dealing with technical documentation, and I would say that accuracy trumps prettiness every time.

@kortschak
Copy link
Contributor Author

As a point of clarification, the top comment suggests that the rewrite originated in #51082. This is not quite true. The rewrite of `` and '' to and has applied when converting documentation to HTML since before the public Go release. (Go adopted this common convention widely used in text markup, including in Troff and TeX and more recently in some versions of Markdown.)

This is a misreading of what I wrote. While I dislike smart quotes in webpages, I am significantly more concerned about source code rewrites. This is new (well, it was when I wrote the issue). I am concerned with how this impacts on ability to read source in editor.

@rsc
Copy link
Contributor

rsc commented Jul 19, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@kortschak
Copy link
Contributor Author

That said, I fear your full expectation is not far off the mark: f'' has had this meaning for 15 years, which probably is too long to change, at least retroactively.

It has had this meaning for a long time, yes, but the rewrite — even at the rendering level — has existed for a relatively short time. Prior to the changes to pkgsite, there was no tool that rewrote comment semantics here. On noticing that this had happened I raised #51807 and sent this change fixing breakage of <pre> block handling of text with '', I later sent a change to revert, but this change stagnated without discussion from pkgsite owners. Later the discussion at #51082 happened and then semantic rewrites — beyond just rewriting at the arguable not-completely-horrible level of an information interface (i.e. at rendering) — was implemented in https://go-review.googlesource.com/c/go/+/397280. gofmt rewrites code, that's fine, but until this change it had never changed the semantics of the comments. This is something that it does now and is new, not 15 years old.

@rsc
Copy link
Contributor

rsc commented Jul 20, 2023

I understand the points you are both making.

The point I am making is that doc comments have a certain semantic meaning as defined by how they render in HTML. The gofmt formatting changes for doc comments make the text form more closely reflect that semantic meaning, so that something that before was an only-in-HTML problem is now an in-the-source-code problem.

While I dislike smart quotes in webpages, I am significantly more concerned about source code rewrites. This is new (well, it was when I wrote the issue). I am concerned with how this impacts on ability to read source in editor.

When the doubled single quotes are intended as smart double quotes, the conversion should improve the ability to read the comments in the editor. The problem is when a doubled single quote is used to mean something different, like double-prime. For better or worse, doubled single quote does not have the semantic meaning double-prime in Go doc comments. There are other code point sequences with that meaning, as has been discussed at length.

To the extent that the suggestion is to stop changing comments to more closely reflect their semantic meaning, it's unlikely we would do that. To me, it's a good thing that the source code is kept aligned with what it actually says, and what f'' says in its official doc comment interpretation is f”. Same reason gofmt rewrites a+b * c into a + b*c.

That's why instead I shifted the conversation to what it would mean to change the semantic meaning of these comments to say that doubled ASCII single quotes no longer turn into double-quotes.

Prior to the changes to pkgsite, there was no tool that rewrote comment semantics here. On noticing that this had happened I raised #51807 and sent this change fixing breakage of

 block handling of text with ''

For unfortunate historical reasons, pkgsite uses a different renderer than the standard library. It's not the canonical renderer, it has ad-hoc adjustments that were not vetted by any kind of proposal process, and what it does is not authoritative on any of this. In the long term pkgsite should be converted to use the standard library renderer. I will note that the standard renderer has never had the bug you reported in #51807 of rewriting doubled single quotes in code blocks. That's certainly a real bug.

I can also say that rewriting of comments to match their semantic meaning is focused on #51082 and go/doc. What pkgsite does was not taken into account.

@kortschak
Copy link
Contributor Author

To the extent that the suggestion is to stop changing comments to more closely reflect their semantic meaning, it's unlikely we would do that.

This, I think, is the central point of contention. I do not believe that anyone knows the semantic intention of the written text in a comment better than the author at the time of writing, much less a machine. My focus has been on mathematics symbols, and I think this has allowed wiggle room, but similar issues arise with the use of literally significant use of '' in documentation relating to inter-language packages. This has been raised by others. In the absence of a mechanism to force in-line <pre> blocks in godoc, this forces authors to use code blocks when referring to syntax that depends on this. Worse, the changes are made in internal comments that are not intended to be rendered to the user and may have significant semantics associated with the choice of code point.

@cespare
Copy link
Contributor

cespare commented Jul 20, 2023

I don't think it has been the case that most, or even many, people writing Go code over the last 15 years have been aware of these semantics for ''. As one data point, when we reformatted our Go code at work for Go 1.19 gofmt we had zero instances of intentional use of '' to mean "smart quotes" but we did have a couple of instances where things had to be adjusted to avoid bad rewrites. It might be interesting to do similar analysis on larger corpora.

@kortschak
Copy link
Contributor Author

Thanks, Ian, yes, I'm aware of that. That is the non-inline form I was referring to. This form disrupts reading.

@willfaught
Copy link
Contributor

willfaught commented Jul 21, 2023

In my opinion, '' shouldn't be rewritten in plain text. As @rsc said, there are other code points for these things, if that is what you want. If you want fancy TeX syntax, go write TeX...away from Go documentation.

We shouldn't be encouraging non-ASCII punctuation for English writing in Go documentation, anyway. We're not writing fancy, typeset dialogue. If you mean to write a double quote, then write ". We don't want half of comments to use one syntax, and the other half to use another. We limited bulleted list syntax to hyphens only for this very reason.

There are ways to automatically render ' and " as matched curly versions, if we really want curly quotes. See how Hugo does it.

At the very least, if we're going to rewrite plain text, then we need to provide a way to denote inline preformatted formatting like in Markdown so people can keep their primes'''.

(Edited again)

@magical
Copy link
Contributor

magical commented Jul 25, 2023

Here is an example in the wild of gofmt mangling the documentation in @robpike's project, ivy.

robpike/ivy@3d1d0ff#diff-a20b1b3b4b2bca5bef2b853a6e3f19def513381f9c7ee68d6979f0435c885dcfL210

 Syntactically, string literals are very similar to those in Go, with back-quoted
 raw strings and double-quoted interpreted strings. Unlike Go, single-quoted strings
 are equivalent to double-quoted, a nod to APL syntax. A string with a single char
-is just a singleton char value; all others are vectors. Thus ``, "", and '' are
+is just a singleton char value; all others are vectors. Thus “, "", and ” are
 empty vectors, `a`, "a", and 'a' are equivalent representations of a single char,
 and `ab`, `a` `b`, "ab", "a" "b", 'ab', and 'a' 'b' are equivalent representations
 of a two-char vector.

@thepudds
Copy link
Contributor

FWIW, #61365 is a concrete example involving comments discussing SQL quoting syntax in a SQL parser.

we also have 15 years of documentation that expects the existing quote conversions, and it seems like a mistake to change that rule after such a long time

I haven't quite followed this issue, and I don't know if this is a valid way of thinking about it, but it seems that this transform was happening for many years, but it was happening only transiently at render time (e.g., when rendering to HTML on godoc.org).

And now as of Go 1.19, the transform started happening in two places:

  1. it still happens transiently at render time, and
  2. the transform now also happens persistently at the source code level, thanks to gofmt.

For actively maintained source files (files that have been edited with a modern toolchain), the render-time transform is now effectively a no-op because the source code already has the smart quotes.

Of course, an actively maintained project might not actively edit each source file (and not all projects do a project-wide gofmt), so some files might not get touched frequently, or ever.

I wonder if in the future it might be reasonable to turn off the transform for both the HTML rendering and for code formatting, because at some point enough of the existing actively maintained code that relies on the transform would still render as “expected” by that author because their source code was already persistently transformed. Of course, people would need to learn to stop typing '' and expecting smart quotes to appear as of a certain Go future release, but maybe that would be comparable to the level of change in human behavior needed due to some gofmt changes in the past?

If that happens at some point, then for a project or file that intended smart quotes but has not been actively edited since Go 1.19 (e.g., maybe a project is "done") or when looking at pre-1.19 versions of documentation for some projects on pkg.go.dev, the render-time view would match the source code view (e.g., both showing ''), which might still be understandable to a reader? Or at least, the render-time view might be no more confusing than the source code view.

@rsc
Copy link
Contributor

rsc commented Aug 16, 2023

I'm not completely ignoring this - I have a program running to gather data we need and haven't gotten through the results yet.

@rsc
Copy link
Contributor

rsc commented Dec 21, 2023

I extracted all the doc comments containing single forward or backward quotes from my corpus of code from proxy.golang.org, totaling 90GB. I've attached a shuffled random sample of 100k comments as docquote100k.txt.gz. The sampling is just from the full collection of comments with no regard to packages or modules, so if one package or module has 10X the comments of another, that package or module's comments will be sampled at 10X the rate of the other.

In 88% of those comments, the only single quotes are forward quotes between alphabetic ASCII characters, like in “don't communicate by sharing memory”. Those are served perfectly well by the existing rule but also easy to identify, so it's worth filtering them out. Let's call that the contraction form (it also includes possessives).

Ignoring contraction quotes, docquote2-100k.txt.gz is a new shuffled random sample of 100k comments with non-contraction-form quotes. (The sampled comments all contain non-contraction-form quotes, but they may also contain contraction-form quotes.) Further downsampling, docquote2-1k.txt.gz is a sample of only 1k comments. Those 1000 comments break down as follows:

Backquote `x` only (783)

1 2 3 4 5 6 7 8 9 10 12 13 16 18 19 20 21 23 24 25 26 27 28 29 33 34 35 36 38 40 41 42 43 44 45 47 48 49 52 53 55 56 59 60 61 64 65 66 67 68 70 71 72 73 74 77 78 79 80 82 83 84 85 86 87 89 90 91 92 93 94 96 97 98 99 103 105 106 107 109 111 113 114 115 116 117 118 119 120 121 122 123 124 126 127 128 129 130 131 135 137 139 140 143 144 145 146 147 148 150 151 152 153 155 156 158 159 161 162 163 164 165 167 168 170 172 173 174 175 176 178 180 181 182 183 184 185 186 189 191 192 194 195 196 197 198 199 200 201 202 203 205 206 207 210 211 212 213 216 217 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 243 244 245 246 247 248 250 251 253 254 255 257 258 259 260 261 262 263 264 265 266 268 270 271 273 274 275 276 277 278 279 280 282 283 285 286 287 288 289 290 291 292 293 294 295 296 297 298 300 301 302 303 304 305 307 308 309 311 312 313 314 316 318 319 321 322 323 325 327 328 329 330 331 333 334 336 337 338 339 340 341 342 343 344 345 346 347 349 350 351 352 353 354 355 356 357 358 360 361 362 363 364 366 367 368 369 370 372 373 374 375 376 377 379 380 381 384 385 386 387 388 389 390 391 392 393 396 397 398 399 400 402 403 404 405 407 408 409 410 411 414 416 417 418 419 420 421 422 424 425 426 427 428 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 447 448 449 450 451 452 454 455 456 457 458 459 460 461 451 463 465 468 469 470 471 472 473 475 476 477 479 480 481 482 483 485 486 487 489 490 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 513 514 515 516 517 519 520 521 522 523 525 526 527 528 530 531 532 533 534 535 537 538 541 542 543 544 545 546 547 548 549 550 552 553 555 557 558 559 560 561 562 563 564 565 567 568 569 570 572 573 574 577 578 579 580 581 583 584 585 586 587 588 590 591 592 593 594 596 597 598 600 601 602 603 604 605 606 607 608 611 612 613 614 615 616 618 619 620 621 622 623 624 625 627 628 629 630 631 632 633 634 639 640 641 644 645 647 650 651 653 654 655 656 657 658 660 662 665 666 667 668 670 672 673 674 676 677 678 679 680 683 685 686 687 688 689 690 691 692 693 695 696 697 698 699 700 701 702 703 704 705 706 707 710 712 713 714 716 717 719 720 722 723 724 725 726 727 729 731 732 733 734 735 737 738 740 744 746 747 748 749 750 751 752 753 755 756 757 758 759 760 762 764 765 766 767 768 769 770 771 772 773 774 776 777 778 779 781 782 783 784 785 787 788 789 791 792 793 794 795 796 799 801 803 804 806 807 808 809 810 811 813 817 818 819 820 821 823 824 827 828 829 831 832 833 834 835 836 838 839 840 841 843 846 847 848 849 850 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 872 873 874 875 877 878 879 881 883 885 886 887 889 890 891 893 894 896 897 899 900 901 902 903 904 905 906 907 908 909 910 911 912 914 915 917 918 920 922 923 924 925 926 928 929 930 933 934 935 936 937 940 941 942 943 944 945 946 947 949 951 952 954 956 957 958 959 960 962 963 964 965 966 968 969 970 971 972 973 974 975 976 977 979 980 981 984 985 987 988 989 990 991 992 993 994 996 998 999 1000

Some are inconsistent, only putting backquotes around some names and leaving others unquoted, like 40, 74, 129, among others.

Forward quote 'x' only (128)

11 14 15 17 22 30 46 50 51 57 58 63 69 75 76 81 88 95 100 101 108 110 112 125 133 136 141 142 154 157 160 166 169 187 190 193 204 208 215 218 219 220 242 249 256 269 272 284 299 306 315 317 320 324 332 348 382 383 394 395 401 412 413 423 453 462 466 478 488 518 524 536 540 551 566 575 576 582 589 595 599 609 610 617 626 635 637 642 643 646 648 649 659 663 669 682 708 709 745 754 786 790 798 800 802 805 812 814 815 816 822 837 842 844 876 880 882 884 892 895 898 919 932 938 939 950 961 982

Backquote `x` and double quote "x" (19)

138 371 378 464 467 491 652 671 681 728 739 761 775 825 826 927 983 986 997

Backquote `x` and forward quote 'x' (13)

39 62 104 134 149 359 365 711 715 736 743 830 955
Smart single quotes `x' (7)

102 177 310 675 741 763 931

Backquote and triple-backquote (6)

31 32 37 694 916 967

Forward quote 'x' and double quote "x" (6)

571 721 730 845 851 978

Triple backquote only (5)

171 406 539 554 797

Double backquote ``x`` (4)

281 638 913 953

Double forward quote ''x'' (2)

415 871

Backquote, triple-backquote, and double-quote (1)

995

Forward quote and triple-backquote (1)

214

Trifecta! Forward quote 'x', backquote `x` and double quote "x" (1)

921

Backquote `x` and HTML \&quot;x\&quot; (1)

179

Smart doubled single quotes ``x'' (1)

742

Escaped forward quotes \'x\' (1)

188

Forward-backward paired quotes `'x`' (1)

267

False positive due to struct tags in code display (2)

252 684

Bad shuffle splitting bug or other false positive (5)

54 132 484 664 718

Punctuation lists ((!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~)) (5)

209 429 636 661 780

Stray quote in code block or commented-out code (5)

326 512 529 556 948

Forward single quotes mid-syntax (1)

335

Backquote in +kubebuilder:validation:Pattern directive (2)

446 474

Literal backquote inside double-quoted string literal (1)

888

A few observations:

  • A few modules account for the lion's share of these non-contraction quotes:

    • 196 (19.6%) of the comments are from google.golang.org/genproto, repeating comments from .proto files, which use different comment conventions. It is possible that the protobuf compiler should rewrite those comments to Go syntax.

    • 183 came from modules containing the substring “pulumi”.

    • 70 came from modules containing the substring “azure” (but not “pulumi”)

    • 61 came from modules containing the substring “aws” (but not “pulumi”)

    In all, these account for more than half the comments.

  • Backquoted names are very common when non-contraction quotes are present (78% of comments sampled), due to the influence of Markdown, often in code generated from other languages or specification formats. But even in those comments, they are applied inconsistently. Despite the usage, adding those to Go comments are a non-starter, for two reasons: (1) it would lead to endless arguments about the style for what gets backquoted and doesn't in comments, which contracts the goal of doc comments being lightweight to both read and write, and (2) Go already has backquoted string syntax that may need to appear in doc comments anyway.

  • Reading the comments, I think we could very accurately decide opening and closing quotes (opening quotes have spaces or start of line before them, closing have punctuation or spaces or end of line after them) and match them. Text like `abc` or 'xyz' or "αβ" could render in HTML in a code font as `abc` or 'xyz' or "αβ" (with the quotes), while `abc' could continue to render as quotation marks. This would both handle Go string and character literals reasonably well and highlight misuse of the quotes with other meanings. The comment text formatting would be unaffected. That said, it introduces a lot of complexity and only for people misusing Go, so on balance it's probably not worthwhile.

  • Doubled quotes are relatively rare, or at least outweighed by all this generated code echoing comments from other languages. We'll look at those next.

  • It makes sense that doubled quotes would be rare, because anyone using gofmt would have them replaced with “ and ”. At some level it's surprising how many triple-backquotes there are. Probably this generated code is not run through gofmt, or it is old and predates the gofmt rewrite.


In docquote2-100k.txt, using regular expressions to classify:

  • There are 1438 comments matching ```, or about 1.4%. This is pretty close to the 13/1000 we saw before. There's nothing to be done with those, since Go already has a code block syntax; we already avoid turning ``` into “`.

  • There are 661 comments matching `` but not ```, 524 comments matching '' but not '''; 1047 total (there is some overlap). docquote3.txt.gz

  • Of the 1047 comments, 433 are Markdown-style double-backquotes: ``foo``.

  • Of the 1047 comments, 204 are doubled single quotes on both sides: ''foo''. I am not sure why this is a common syntax.

  • Of the 1047 comments, 134 are Go-style doubled single quotes: ``foo''. As noted above, these must not have been gofmt'ed yet with a new gofmt. Of course, the others must not have been gofmt'ed yet either, but they may be mostly generated code that was not gofmt'ed, as opposed to native Go code.

Sampling 100 of the 1047 and inspecting by hand:

  • 42 ``x``
  • 25 ''x''
  • 11 ``x'' (Go syntax)
  • 5 mismatched, like ''listening_method'
  • 5 empty strings ''
  • 4 nested single quotes, like 'category eq '{value}''
  • 2 doc comment mentions in Go project code, like ‘turn `` into “ and '' into ”.’
  • 2 accidental doubled quotes due to markdown escaping, like `\`*\ represents a field named * ``.
  • 1 mixing ``x`` and ''x'' in the same comment
  • 1 doubled prime: f' and f''
  • 1 double-quoted doubled single quote: Add_quotes appends "''" to the content stream.
  • 1 doubled '' inside a '-quoted string, like in Pascal.

Based on all this we can very roughly estimate about 10 out of 100k comments meaning '' as a doubled prime, or about 1 per 10,000 of the comments containing non-contraction-form quotes. So it happens, but it's not terribly common. Real Go quoted comments outnumber doubled primes 10 to 1.

Grepping for comments with lines containing just a single doubled-quote in docquote2-100k (^[^`']*''[^`']*$) and then reading the results, there are:

  • 21 empty string
  • 1 in CJK text I can't read
  • 1 stray ''
  • 1 double-prime

Note that if multiple double-primes occurred on a line, they wouldn't get caught by that expression, so it may be undercounting, but probably not by a lot, and it estimates 1 in 100k (down from the previous 1 in 10k). Grepping for single ``x'' on a line in the same set finds 71, so by this estimate Go quoted comments outnumber doubled primes 70 to 1.

So doubled primes are not a reason to turn off the Go quoted comment changes.

It's worth noting that Go project code mentioning the quote conversion was about 2% of the samples found, suggesting that there's just not many doubled quotes in the ecosystem overall (only 50X more than Go project code).

A more compelling reason might be all the misuse of ``x`` or ''x'', but again most of these seem to be in auto-generated APIs, so they are not terribly compelling. The most compelling reason is probably empty strings.

@rsc
Copy link
Contributor

rsc commented Dec 21, 2023

All in all, I think this is a giant mess, mostly apparently caused by people shoehorning comment syntaxs from other languages into Go files. I don't think we have to complicate Go in response, any more than we do when people, say, translate Java literally to Go; those people are using Go wrong and should not do that.

That said, I also think we can tighten up the `` and '' conversion and fix most double-primes and empty strings, without losing the intentional quotation marks:

  1. A `` is only eligible to be converted to “ if it is preceded by a space or start of line and not followed by a space (using unicode.IsSpace).

  2. A '' is only eligible to be converted to ” if it is followed by a space, punctuation, or end of line, and not preceded by a space (using unicode.IsSpace and unicode.IsPunct).

  3. An eligible `` must be matched to the next eligible '' in the same paragraph (with no eligible `` or '' in the middle), and then both are converted.

This should correctly handle double-primes, unless they are inside `` and '' intended as quotes, but in that case just type “ and ” directly, or edit the gofmt'ed output to correct it, and it will stick. This should also correctly handle empty strings, even things like you can write both `` and '' or like both are fine (`` or '') (both of those would be left alone).

I think we could reasonably make those changes for Go 1.23.

@kortschak
Copy link
Contributor Author

@rsc thank you for doing that analysis. I still disagree with the notion that it is OK to rewrite comments given that the author of the comment knows the intended semantics better than anyone else (as stated in my comment here). The proposed looks like to will reduce the harm done by these rewrites, and I think that is is possible to work around the harms that are done. Ideally, we would not be rewriting, but I think that despite the commentary here, I'm not going to convince you of that.

@gophun
Copy link

gophun commented Dec 21, 2023

Forward quote 'x' only (128)

It's interesting that you consider ' as a forward quote. To me, it's a straight quote, and ´ would be a forward quote, the matching partner to the backquote `.

Smart single quotes `x' (7)

It's strange that anyone would do this. It seems typographically incorrect and unbalanced compared to `x´.

@zephyrtronium
Copy link
Contributor

@gophun Most US keyboard layouts do not have a way to input ´ at all. `x' and ``x'' is a convention that arose decades ago as an approximation to proper styled quotes within ASCII (the American Standard Code for Information Interchange). Computing unfortunately has a long history of anglocentrism.

@bcmills
Copy link
Contributor

bcmills commented Jan 5, 2024

It makes sense that doubled quotes would be rare, because anyone using gofmt would have them replaced with “ and ”. At some level it's surprising how many triple-backquotes there are. Probably this generated code is not run through gofmt, or it is old and predates the gofmt rewrite.

To me, that sounds like it points to an unfortunate flaw in the analysis: it is trading off two sources of survivorship bias.

  1. Code that is run through gofmt may have instances of the characters already unintentionally converted to or without the author noticing.
  2. Users who expect '' to remain intact may have already noticed that gofmt rewrites that sequence, and may have already worked around it by rephrasing the comment.

We could perhaps measure (1) by looking for unexpected (unpaired?) occurrences of the characters and in comments in Go source files.

I don't see how we can measure (2) beyond upvotes on this issue, since that manifests as developer friction but often won't leave evidence in the source code.

@cespare
Copy link
Contributor

cespare commented Jan 5, 2024

To call out what seems to me to be the elephant in the room: is there anyone who is not @rsc who affirmatively prefers the smart quote rewrites? I've followed this discussion pretty closely on this thread, on #51082, and on the many linked issues, and the sentiments I've seen from other folks range from non-committal to surprised to strongly averse. But I'm not aware of instances where someone said they like or want this behavior. Did I miss it? Is there some silent group of programmers out there who like the smart quote rewriting and will voice their displeasure if it goes away?

If the entire constituency for this feature is @rsc, then ISTM that what we are trying to figure out is the minimum amount of quote-rewriting that @rsc will accept.

I know this is a bit blunt, but I'm trying to clarify what exactly what is going on here.

@rsc
Copy link
Contributor

rsc commented Jan 10, 2024

We have a documented behavior that has existed for over a decade. I'm trying to respect the existing usage while avoiding the problem that @kortschak ran into. #54312 (comment) seems to do that.

Does anyone object to #54312 (comment) ?

@kortschak
Copy link
Contributor Author

I do.

@rsc
Copy link
Contributor

rsc commented Jan 24, 2024

Given that we rewrite `` and '' (doubled single quotes) in all cases now, #54312 (comment) seems like a strict improvement. I don't think removing this decade-old behavior entirely is on the table.

@kortschak
Copy link
Contributor Author

kortschak commented Jan 24, 2024

Seems like this discussion was essentially pointless. I'm sorry I raised it.

At the very least I'd like to see the re-writes that touch comments that are never rendered into godoc being reverted. There is zero reason to make those rewrites and the tools (as repeatedly stated above) does not know the intention of the author better than the author does.

@rsc
Copy link
Contributor

rsc commented Jan 24, 2024

At the very least I'd like to see the re-writes that touch comments that are never rendered into godoc being reverted.

I'm not sure what you mean here but the rewrites only affect doc comments. Non-doc comments are never modified. If you see any non-doc comments being modified, please post an example and we will fix it. Thanks!

@arp242
Copy link

arp242 commented Jan 24, 2024

I don't think removing this decade-old behavior entirely is on the table.

Why not? Just because it's old doesn't mean it's good, or that it doesn't introduce more problems than it solves.

For a very long time this was more or less invisible behaviour most people weren't aware of. I think very few people would complain if it was removed.

Whether one likes it or not, using straight "quotes" is a standard way to write English, and that's been the case for quite a while now. It's fine to introduce typographical niceties, but doing that in a programming language context seems misplaced. I'm not writing for The New York Times, I'm writing Go code.

A complex rewriting scheme that takes several paragraphs to explain is not something I'd consider hugely desirable, and seems the sort of hidden complexity that should be avoided if at all possible.

I'm not sure what you mean here but the rewrites only affect doc comments.

It's still a "doc comment" if the function is unexported, as per my earlier example. But that "doc comment" never shows in godoc.

@kortschak
Copy link
Contributor Author

kortschak commented Jan 25, 2024

I'm not sure what you mean here but the rewrites only affect doc comments. Non-doc comments are never modified.

Thank you for correcting me. I have confirmed that this is true mod the non-rendered unexported function comments (this is a fun one which I don't think would be handled correctly by the rules above). Apologies.

All up, I guess the proposed solution is acceptable. I can't say I like it, but the grosser parts of the rewrites can now be worked around with some effort. It strikes me that "but in that case just type “ and ” directly" would have been a good solution.

@rsc
Copy link
Contributor

rsc commented Jan 31, 2024

@kortschak In your example https://go.dev/play/p/FLLDRxzddIW those quotes would be left alone in the new rules, since (1) the '' has spaces around it and (2) there is no `` to match it with.

@kortschak
Copy link
Contributor Author

Thank you for clarifying that for me.

@rsc
Copy link
Contributor

rsc commented Feb 8, 2024

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

Currently, in doc comments, doubled single quotes `` and '' are rewritten to Unicode double quotes “ and ”. (This has been the behavior of godoc -> html conversion since before the public Go release. The doc comment reformatter just makes that clearer.)

The proposal is to tighten the conversion rules, both in the doc comment reformatter and in the html conversion, to the following:

  1. A `` is only eligible to be converted to “ if it is preceded by a space or start of line and not followed by a space (using unicode.IsSpace).

  2. A '' is only eligible to be converted to ” if it is followed by a space, punctuation, or end of line, and not preceded by a space (using unicode.IsSpace and unicode.IsPunct).

  3. An eligible `` must be matched to the next eligible '' in the same paragraph (with no eligible `` or '' in the middle), and then both are converted.

@rsc
Copy link
Contributor

rsc commented Feb 14, 2024

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

Currently, in doc comments, doubled single quotes `` and '' are rewritten to Unicode double quotes “ and ”. (This has been the behavior of godoc -> html conversion since before the public Go release. The doc comment reformatter just makes that clearer.)

The proposal is to tighten the conversion rules, both in the doc comment reformatter and in the html conversion, to the following:

  1. A `` is only eligible to be converted to “ if it is preceded by a space or start of line and not followed by a space (using unicode.IsSpace).

  2. A '' is only eligible to be converted to ” if it is followed by a space, punctuation, or end of line, and not preceded by a space (using unicode.IsSpace and unicode.IsPunct).

  3. An eligible `` must be matched to the next eligible '' in the same paragraph (with no eligible `` or '' in the middle), and then both are converted.

@rsc rsc changed the title proposal: go/doc: reconsider comment rewrites of '' to go/doc: reconsider comment rewrites of '' to Feb 14, 2024
@rsc rsc modified the milestones: Proposal, Backlog Feb 14, 2024
@rsc
Copy link
Contributor

rsc commented Mar 8, 2024

A belated reply to #54312 (comment), after being pinged privately about it.

In general our approach is not to change documented behavior without a very good reason, in the spirit of both Chesterton's Fence and Go compatibility.

I did the work of reading through and presenting lots of existing usage in #54312 (comment). The key point is "Real Go quoted comments outnumber doubled primes 10 to 1." To discard existing usage, we need a much stronger reason than "people don't like it" or even "we all agree it was a mistake". Not breaking things that work is crucial to the overall approach we take in Go.

That's why I suggested the compromise of converting fewer but keeping all the real usage working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests