-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: bufio: add Reader.DiscardRunes(n int) and utf8: add RuneByteLen #47621
Comments
Please post code as ordinary text or a link to the Go playground, not as an image. Images are hard to read. Thanks. |
To clarify, I think you are suggesting something like this: // DiscardRune skips the next n UTF-8 encoded runes, returning the number of bytes discarded.
//
// If Discard skips fewer than n runes, it also returns an error.
func (b *Reader) DiscardRune(n int) (discardedBytes int, err error) Should this be |
Definitely Since |
The ordinary code has been pasted. And has been renamed to DiscardRunes |
func BenchmarkDiscardVsRead(b *testing.B) {
b.Run("discardRunes", func(b *testing.B) {
data:= strings.Repeat("中",4097)
for i := 0; i < b.N; i++ {
buf:=bytes.NewBufferString(data)
b:=NewReader(buf)
b.DiscardRunes(4097)
}
})
b.Run("readRuneForDiscard", func(b *testing.B) {
data:= strings.Repeat("中",4097)
for i := 0; i < b.N; i++ {
buf:=bytes.NewBufferString(data)
b:=NewReader(buf)
for i:=0;i<4097;i++{
b.ReadRune()
}
}
})
}
As in the bench test, although there is a rune operation, |
Change https://golang.org/cl/341390 mentions this issue: |
cool, I have the same scenario, good job! |
If this needs to get supported: Instead of GetRuneBytes can the implementation reuse the existing DecodeRune? Its seems a lot of the logic would otherwise be duplicated. |
GetRuneBytes has been deleted |
This proposal has been added to the active column of the proposals project |
It seems like a mistake for the protocol to use rune-length-prefixed strings, but we can't do anything about that. Is it possible to speed up DecodeRune instead? It seems like GetRuneBytes in the code above is exactly utf8.DecodeRune, just without the shifts and ors. It's hard to believe that that's 30% of the time. Why is this special case so much faster, and can we make the general case faster? |
|
Change https://golang.org/cl/349773 mentions this issue: |
This seems too special-purpose and poorly defined to be added to the standard library. If it's truly a Java service you're talking to, you might not want rune at all but something to do with UTF-16, which is a subtly different thing. Also what happens with surrogate pairs and other horrors? I believe the logic is specific to the protocol, and should be implemented there. In the meantime, I agree that if we can speed up some of the operations in the library to speed up your solution without adding new functionality, that would be a good way to respond. If the proposal does go forward, it will need to follow Go's conventions about invalid UTF-8, which are not applicable to Java anyway, another vote against. |
Maybe what you said makes sense. But we need a solution. Just submit the discardRune proposal, Abandon the utf8 proposal, is this okay? |
Since its very special use case: Could this be used when implemented as a new standalone Reader implementation outside the standard library or is there a special reason the standard library bufio Reader needs to be extended? |
If there are reasonable performance optimizations possible in utf8.DecodeRune, let's apply them. Today we already have r.Peek(r.Buffered()) to see the entire buffered data, It seems like we should probably decline to add this to the standard library. |
Based on the discussion above, this proposal seems like a likely decline. |
Thanks for the positive response and help of the go team, the current proposal can be closed, and the existing methods can be used to extend the required functions, because the current scenarios are indeed relatively few. The following is the code, hope to help other brothers in need.
package sofahessian
import (
"bufio"
"errors"
"unicode/utf8"
)
const (
t1 = 0b00000000
tx = 0b10000000
t2 = 0b11000000
t3 = 0b11100000
t4 = 0b11110000
t5 = 0b11111000
maskx = 0b00111111
mask2 = 0b00011111
mask3 = 0b00001111
mask4 = 0b00000111
rune1Max = 1<<7 - 1
rune2Max = 1<<11 - 1
rune3Max = 1<<16 - 1
// The default lowest and highest continuation byte.
locb = 0b10000000
hicb = 0b10111111
// These names of these constants are chosen to give nice alignment in the
// table below. The first nibble is an index into acceptRanges or F for
// special one-byte cases. The second nibble is the Rune length or the
// Status for the special one-byte case.
xx = 0xF1 // invalid: size 1
as = 0xF0 // ASCII: size 1
s1 = 0x02 // accept 0, size 2
s2 = 0x13 // accept 1, size 3
s3 = 0x03 // accept 0, size 3
s4 = 0x23 // accept 2, size 3
s5 = 0x34 // accept 3, size 4
s6 = 0x04 // accept 0, size 4
s7 = 0x44 // accept 4, size 4
)
var (
ErrReaderBufIsEmpty = errors.New("sofa-hessian: reader.buf is empty")
)
// first is information about the first byte in a UTF-8 sequence.
var first = [256]uint8{
// 1 2 3 4 5 6 7 8 9 A B C D E F
as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x00-0x0F
as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x10-0x1F
as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x20-0x2F
as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x30-0x3F
as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x40-0x4F
as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x50-0x5F
as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x60-0x6F
as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x70-0x7F
// 1 2 3 4 5 6 7 8 9 A B C D E F
xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0x80-0x8F
xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0x90-0x9F
xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0xA0-0xAF
xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0xB0-0xBF
xx, xx, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, // 0xC0-0xCF
s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, // 0xD0-0xDF
s2, s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, s4, s3, s3, // 0xE0-0xEF
s5, s6, s6, s6, s7, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0xF0-0xFF
}
// acceptRange gives the range of valid values for the second byte in a UTF-8
// sequence.
type acceptRange struct {
lo uint8 // lowest value for second byte.
hi uint8 // highest value for second byte.
}
// acceptRanges has size 16 to avoid bounds checks in the code that uses it.
var acceptRanges = [16]acceptRange{
0: {locb, hicb},
1: {0xA0, hicb},
2: {locb, 0x9F},
3: {0x90, hicb},
4: {locb, 0x8F},
}
// DiscardRunes skips the next n runes, returning the number of bytes discarded.
//
// If DiscardRunes skips fewer than n runes, it also returns an error.
// If 0 <= n <= b.Buffered(), DiscardRunes is guaranteed to succeed without
func DiscardRunes(reader *bufio.Reader, n int) (int, error) {
if n <= 0 {
return 0, bufio.ErrNegativeCount
}
totalDiscardBytes := 0
runeCount := 0
for runeCount < n {
if reader.Buffered() == 0 {
// Initial state/critical need to trigger fill
_, err := reader.Peek(1)
if err != nil {
// The case of empty data
return totalDiscardBytes, err
}
}
bufLen := reader.Buffered()
if bufLen == 0 {
// Robust judgment to avoid infinite loop
return totalDiscardBytes, ErrReaderBufIsEmpty
}
// Here peek will not open up new memory space, so there is no need to worry about memory loss when n is small
buf, err := reader.Peek(bufLen)
if err != nil {
return totalDiscardBytes, err
}
byteCount := 0
fullRuneFlag := true
for byteCount < len(buf) {
r, byteSize := rune(buf[byteCount]), 1
if r >= utf8.RuneSelf {
byteSize = RuneByteLen(buf[byteCount:])
if byteSize == -1 {
fullRuneFlag = false
break
}
}
byteCount += byteSize
runeCount++
if runeCount >= n {
break
}
}
discardBytes, err := reader.Discard(byteCount)
if err != nil {
return discardBytes, err
}
totalDiscardBytes += discardBytes
if !fullRuneFlag {
err = peekForFill(reader)
if err != nil {
return totalDiscardBytes, err
}
}
}
return totalDiscardBytes, nil
}
const UTF8MaxByte = 4
// peekForFill incremental peek triggers reader.buf fill
// Detailed description: When critical, that is, the remaining // byte is not a complete rune, fill needs to be triggered again,
// But the peek value cannot be evaluated, it can be incremented
func peekForFill(reader *bufio.Reader) error {
for i := 0; i < UTF8MaxByte; i++ {
bytes, err := reader.Peek(i + 1)
if err != nil {
return err
}
byteSize := RuneByteLen(bytes)
if byteSize != -1 {
return nil
}
}
return nil
}
// RuneByteLen returns the number of bytes of the first utf-8 encoding in p.
// If p is empty it returns 0. Otherwise, if
// the encoding is invalid, it returns 1.
// Both are impossible results for correct, non-empty UTF-8.
//
// an encoding is invalid if it is incorrect utf-8, encodes a rune that is
// out of range, or is not the shortest possible utf-8 encoding for the
// value. no other validation is performed.
func RuneByteLen(p []byte) (size int) {
n := len(p)
if n < 1 {
return 0
}
// Handling the case where critical byte is not a complete rune
if !utf8.FullRune(p) {
return -1
}
p0 := p[0]
x := first[p0]
if x >= as {
return 1
}
sz := int(x & 7)
accept := acceptRanges[x>>4]
if n < sz {
return 1
}
b1 := p[1]
if b1 < accept.lo || accept.hi < b1 {
return 1
}
if sz <= 2 { // <= instead of == to help the compiler eliminate some bounds checks
return 2
}
b2 := p[2]
if b2 < locb || hicb < b2 {
return 1
}
if sz <= 3 {
return 3
}
b3 := p[3]
if b3 < locb || hicb < b3 {
return 1
}
return 4
}
|
This proposal has been declined as retracted. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
In mesh service traffic management, there are often businesses that need to parse request messages.
But the protocol analysis performance is the bottleneck.
Therefore, by adding the feature of extracting the corresponding value according to the rules in the decode of the hessian-go library, avoid parsing all request messages.
In the hessian protocol, the length of each data type records the length of characters but the bytes, that is, the length of rune.
For data that does not match to the rules, use
bufio.DiscardRunes
to skip useless data to improve the performance of parsing parameters.Discard
can only skip the length of byte, but there are problems with utf8 type data,so the
bufio.DiscardRunes
andutf8.RuneByteLen
is extended to meet the demand.What did you expect to see?
bufio Reader
addDiscardRunes(n int)
for skipping byte data of specified length characters;utf8
addRuneByteLen
for getting the number of bytes of the first utf-8 encoding in bytes;What did you see instead?
DiscardRunes
has 12% performance improvementDiscardRunes
+RuneByteLen
has 31% performance improvementThe text was updated successfully, but these errors were encountered: