Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: unicode/utf8: rune count in a valid UTF-8 string #57896

Closed
wojciech-sneller opened this issue Jan 18, 2023 · 7 comments
Closed

proposal: unicode/utf8: rune count in a valid UTF-8 string #57896

wojciech-sneller opened this issue Jan 18, 2023 · 7 comments
Labels
Milestone

Comments

@wojciech-sneller
Copy link

I'd like to propose a function to return the number of runes in a valid UTF-8 string. Such function can be a few times faster than utf8.RuneCount -- please check our results: SnellerInc/sneller@9ee35af.

There are use cases when we are sure that the input is valid. Also the Go standard library already provides ToValidUTF8 (https://pkg.go.dev/strings#ToValidUTF8).

@gopherbot gopherbot added this to the Proposal milestone Jan 18, 2023
@seankhliao seankhliao changed the title proposal: unicode/utf8 - rune count in a valid UTF-8 string proposal: unicode/utf8:- rune count in a valid UTF-8 string Jan 18, 2023
@seankhliao seankhliao changed the title proposal: unicode/utf8:- rune count in a valid UTF-8 string proposal: unicode/utf8: rune count in a valid UTF-8 string Jan 18, 2023
@seankhliao
Copy link
Member

can you point to examples of places where this function would be used?

@ianlancetaylor
Copy link
Contributor

What should the function return if the string is not a valid UTF-8 string after all?

There's no particular reason that this function has to be in the standard library. Would it make sense to make it available as a third-party library and see if it gets adoption? https://go.dev/doc/faq#x_in_std

@wojciech-sneller
Copy link
Author

In the case of invalid UTF-8 string the function would return garbage. The use case I have in mind is a system which accepts some possibly broken input, but validates it early and only valid input is passed down; system's components receive trusted, valid strings.

This is approach we used also in simdutf library: the API contains fully validating converters, but there are also faster counterparts that assume valid inputs.

@martisch
Copy link
Contributor

I think its better for more optimised functions (e.g. assuming valid utf8, ascii characters only, mostly ascii, mostly non ascii, ...) to be exposed in special libraries. Otherwise we end up with lots of different functions (and maybe not even clearly naming their difference in assumptions) in the utf8 package all performance optimized for some case but also easily misused.

I think if the library itself can convey what its optimized for it would be better. It does not seem necessary for such a library similar to also simdutf8 to be in the standard library.

That said if the existing utf8 functions can be made faster for common cases without making them more "unsafe" or more unnecessarily complex vs the performance gain that is a possibiity.

@rsc
Copy link
Contributor

rsc commented Mar 15, 2023

In general we work very hard to ensure that Go functions do not "return garbage". That is the C/C++ way, not the Go way. My rant about where that path leads is at https://research.swtch.com/plmm#ub.

@wojciech-sneller
Copy link
Author

Thanks for the comments. After rethinking that, I see the core library is not a proper place for such specialised procedures. My only excuse was existing ToValidUTF8.

One thing I strongly disagree with is claiming that SWAR techniques are unsafe or has anything to do with memory model. It's about unusual access to a well defined data structure. An UTF-8 string is just a sequence of bytes, but it can be viewed as sequence of uint64. BTW, Go allows to modify individual bytes of a string, thus users can do anything and produce garbage sequences.

@rsc
Copy link
Contributor

rsc commented Apr 6, 2023

This proposal has been declined as retracted.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants