Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: unicode/utf8: provide generic versions of all functions #56948

Open
dsnet opened this issue Nov 27, 2022 · 13 comments
Open

proposal: unicode/utf8: provide generic versions of all functions #56948

dsnet opened this issue Nov 27, 2022 · 13 comments

Comments

@dsnet
Copy link
Member

dsnet commented Nov 27, 2022

I'm trying to a write a generic function that operates on either string | []byte. However, I'm unable to do so since the implementation of that function needs to depend on either utf8.DecodeRune or utf8.DecodeRuneInString, but I'm unable to express that as a simple expression without using a type switch.

I propose we add generic versions of:

func DecodeRune[Bytes []byte | string](p Bytes) (r rune, size int)
func DecodeLastRune[Bytes []byte | string](p Bytes) (r rune, size int)
func FullRune[Bytes []byte | string](p Bytes) bool
func RuneCount[Bytes []byte | string](p Bytes) int
func Valid[Bytes []byte | string](p Bytes) bool

It is unclear what the name should be since the simpler names are already taken by the non-generic variants.
Perhaps, we should have a v2 variant of utf8 that operates on either type.

@dsnet dsnet added the Proposal label Nov 27, 2022
@gopherbot gopherbot added this to the Proposal milestone Nov 27, 2022
@bcmills
Copy link
Contributor

bcmills commented Nov 28, 2022

Given the number of existing functions that operate on []byte, and the fact that functions that accept a []byte generally cannot assume they can modify it (because that can in general cause data races), I wonder if it would be more productive to have a single generic function (perhaps in the bytes package?) like:

package bytes

func Of[Bytes ~[]byte | ~string](p Bytes) []byte

It seems to me that that would handle all of the cases where a function accepts a slice-or-string and returns only data parsed from it, including all of the utf8 functions you mentioned above. (However, it would not handle cases like TrimPrefix where the return type of the function depends on the argument type.)

bytes.Of could be implemented today using reflect, although if it were in the standard library it would be permissible for the compiler to recognize and optimize it in a different way.

@aarzilli
Copy link
Contributor

What's the difference between bytes.Of and a cast to []byte?

@bcmills
Copy link
Contributor

bcmills commented Nov 28, 2022

@aarzilli, []byte(s) for a string s makes a mutable copy. bytes.Of would not.

(Compare my unsafeslice.OfString, although I'm becoming increasingly convinced that that operation should not actually be considered “unsafe”...)

@aarzilli
Copy link
Contributor

What would happen if someone did try to mutate the return value of bytes.Of? The cast already doesn't make copies sometimes IIRC, wouldn't it be better to improve that?

@dsnet
Copy link
Member Author

dsnet commented Nov 28, 2022

@bcmills Being able to mutate a string feels pretty unsafe to me.

The motivation for this proposal comes from the fact that we can't efficiently call a non-mutating string-based argument with a []byte without allocation (or vice-versa). To some degree, this is a generalization of the type of problem in #42429. Alternatively, if the compiler allowed implicit unsafe conversions of []byte to string (or string to []byte) when calling pure (non-mutating) functions, that would obviate the need for this in many applications.

The only reason I'm trying to write a generic function that operates on both string and []byte is because of performance.

@DeedleFake
Copy link

You can almost write a function yourself to convert to a string regardless of which you started with using the unreleased Go 1.20, but if you don't want to use a type switch than it still depends on implementation details that you're not supposed to:

// toString returns a string with the same backing data as v. It is
// not safe to hold the returned string if the original was a []byte
// and might be modified.
func toString[T ~string | ~[]byte](v T) string {
	// This assumes that the first word of both the string and slice
	// header structs is a pointer to the data. Unfortunately, there's
	// no way to just get the data from either in an opaque way because
	// indexes into strings aren't addressable, for good reason.
	return unsafe.String(*(**byte)(unsafe.Pointer(&v)), len(v))
}

@dsnet
Copy link
Member Author

dsnet commented Dec 4, 2022

I just filed #57072 as a compiler optimization to somewhat obviate the need for this.

@gopherbot
Copy link

Change https://go.dev/cl/469556 mentions this issue: encoding/json: unify encodeState.string and encodeState.stringBytes

gopherbot pushed a commit that referenced this issue Feb 24, 2023
This is part of the effort to reduce direct reliance on bytes.Buffer
so that we can use a buffer with better pooling characteristics.

Unify these two methods as a single version that uses generics
to reduce duplicated logic. Unfortunately, we lack a generic
version of utf8.DecodeRune (see #56948), so we cast []byte to string.
The []byte variant is slightly slower for multi-byte unicode since
casting results in a stack-allocated copy operation.
Fortunately, this code path is used only for TextMarshalers.
We can also delete TestStringBytes, which exists to ensure
that the two duplicate implementations remain in sync.

Performance:

    name              old time/op    new time/op    delta
    CodeEncoder          399µs ± 2%     409µs ± 2%   +2.59%  (p=0.000 n=9+9)
    CodeEncoderError     450µs ± 1%     451µs ± 2%     ~     (p=0.684 n=10+10)
    CodeMarshal          553µs ± 2%     562µs ± 3%     ~     (p=0.075 n=10+10)
    CodeMarshalError     733µs ± 3%     737µs ± 2%     ~     (p=0.400 n=9+10)
    EncodeMarshaler     24.9ns ±12%    24.1ns ±13%     ~     (p=0.190 n=10+10)
    EncoderEncode       12.3ns ± 3%    14.7ns ±20%     ~     (p=0.315 n=8+10)

    name              old speed      new speed      delta
    CodeEncoder       4.87GB/s ± 2%  4.74GB/s ± 2%   -2.53%  (p=0.000 n=9+9)
    CodeEncoderError  4.31GB/s ± 1%  4.30GB/s ± 2%     ~     (p=0.684 n=10+10)
    CodeMarshal       3.51GB/s ± 2%  3.46GB/s ± 3%     ~     (p=0.075 n=10+10)
    CodeMarshalError  2.65GB/s ± 3%  2.63GB/s ± 2%     ~     (p=0.400 n=9+10)

    name              old alloc/op   new alloc/op   delta
    CodeEncoder          327B ±347%     447B ±232%  +36.93%  (p=0.034 n=9+10)
    CodeEncoderError      142B ± 1%      143B ± 0%     ~     (p=1.000 n=8+7)
    CodeMarshal         1.96MB ± 2%    1.96MB ± 2%     ~     (p=0.468 n=10+10)
    CodeMarshalError    2.04MB ± 3%    2.03MB ± 1%     ~     (p=0.971 n=10+10)
    EncodeMarshaler      4.00B ± 0%     4.00B ± 0%     ~     (all equal)
    EncoderEncode        0.00B          0.00B          ~     (all equal)

    name              old allocs/op  new allocs/op  delta
    CodeEncoder           0.00           0.00          ~     (all equal)
    CodeEncoderError      4.00 ± 0%      4.00 ± 0%     ~     (all equal)
    CodeMarshal           1.00 ± 0%      1.00 ± 0%     ~     (all equal)
    CodeMarshalError      6.00 ± 0%      6.00 ± 0%     ~     (all equal)
    EncodeMarshaler       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
    EncoderEncode         0.00           0.00          ~     (all equal)

There is a very slight performance degradation for CodeEncoder
due to an increase in allocation sizes. However, the number of allocations
did not change. This is likely due to remote effects of the growth rate
differences between bytes.Buffer and the builtin append function.
We shouldn't overly rely on the growth rate of bytes.Buffer anyways
since that is subject to possibly change in #51462.
As the benchtime increases, the alloc/op goes down indicating
that the amortized memory cost is fixed.

Updates #27735

Change-Id: Ie35e480e292fe082d7986e0a4d81212c1d4202b3
Reviewed-on: https://go-review.googlesource.com/c/go/+/469556
Run-TryBot: Joseph Tsai <joetsai@digital-static.net>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Auto-Submit: Joseph Tsai <joetsai@digital-static.net>
@rsc
Copy link
Contributor

rsc commented Mar 15, 2023

Sounds like this is on hold for better compiler optimizations before we can even consider whether this is a good API.

@dsnet
Copy link
Member Author

dsnet commented Mar 15, 2023

#20881 is also another compiler optimization that would address the need for this.
If that was fixed, then we could always do:

utf8.DecodeRune([]byte(in))

regardless of whether in was already a []byte or a string.

@rsc
Copy link
Contributor

rsc commented Apr 6, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Apr 12, 2023

Re utf8.DecodeRune([]byte(in)), sure but it would be even nicer to write utf8.DecodeRune(in) and not worry about whether the conversion allocates.

@rsc
Copy link
Contributor

rsc commented Apr 12, 2023

Placed on hold.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Hold
Development

No branches or pull requests

6 participants