Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes: appending to a single slice from Split output can affect other slices of the output #21149

Closed
JelteF opened this issue Jul 24, 2017 · 30 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@JelteF
Copy link
Contributor

JelteF commented Jul 24, 2017

What version of Go are you using (go version)?

Go 1.8.3

What operating system and processor architecture are you using (go env)?

Linux

What did you do?

https://play.golang.org/p/_6HotFbp1V

What did you expect to see?

To see that variable b remained unchanged and was still a slice containing the b character.

What did you see instead?

The slice in variable b has changed to contain the character a.

Proposed solution

A core solution to this problem would be setting the capacity on the slices coming out of bytes.Split, such that the resulting capacites:

https://play.golang.org/p/VonjdLUymU

This was proposed by https://www.reddit.com/r/golang/comments/6p88m9/golang_slices_gotcha/dknvsxp/

@cznic
Copy link
Contributor

cznic commented Jul 24, 2017

A core solution to this problem would be setting the capacity on the slices coming out of bytes.Split, such that the resulting capacites:

That'd make the result "append-safe", but at the cost of worse performance for code that does the same while aware of/insensitive to the behind-the-scenes append semantics.

Modifying the backing array through a view of it must be always well though out when that backing array is shared. It is an easy trap to fall for in the beginning, admitted, but hopefully the first lesson is enough to learn the proper technique.

@JelteF
Copy link
Contributor Author

JelteF commented Jul 24, 2017

Yes, it is true it would result in worse performance in the cases you mention. However, in this case I would say it's a decent trade of. Actually using the current behaviour consciously, seems almost impossible to me, especially because a new slice will be allocated anyway if too much is appended. Cases where it's insensitive are definitely possible, when only using one of the slices for instance. However, these are very vulnerable to the same issue when in future iterations other slices are also used.

Furthermore, it makes sense for regular slice views to have the full capacity for performance reasons, because there's no reason to believe that there will be other living slices to the same buffer (expecting not to reuse the original slice). For bytes.Split this does not hold, as it returns multiple views referencing the same buffer. It doesn't make sense to assume that the capacity of all of those views is until the end of the original slice. It makes much more sense to say the length is their capacity (possibly including the seperator).

If eventually it is decided that the current behaviour is the desired behaviour (which I hope won't be the case). I would still consider this a documentation bug in the bytes.Split documentation. Nowhere it says that appending to one of the slices can have effect on the other slices:

Split slices s into all subslices separated by sep and returns a slice of the subslices between those separators. If sep is empty, Split splits after each UTF-8 sequence. It is equivalent to SplitN with a count of -1.
You could say this is implicit, but I for one would definitely have benefited from making it explicit.

@josharian josharian added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jul 24, 2017
@josharian josharian changed the title Appending to a single slice from bytes.Split output can affect other slices of the output bytes: appending to a single slice from Split output can affect other slices of the output Jul 24, 2017
@djadala
Copy link
Contributor

djadala commented Jul 25, 2017

IMO this works as expected, if you want copy of slice, then copy it explicitly (or via setting capacity as second example)

@cznic
Copy link
Contributor

cznic commented Jul 25, 2017

Wrt: 'NeedsDecision'. The proposed "fix" is not Go1 guarantee compatible and the current behavior is not a bug.

I would still consider this a documentation bug in the bytes.Split documentation. Nowhere it says that appending to one of the slices can have effect on the other slices.

Nor it says it can lead to nuking an allied country. It's not possible to exhaustively document what will not happen. The behavior is common to all slices unless explicitly treated as in s[:len(a):len(s)]. IMO definitely working as intended. It's a programmers mistake to mutate the shared backing array.

@JelteF
Copy link
Contributor Author

JelteF commented Jul 25, 2017

@djadala, the proposed solution doesn't make copies it still make slices. The only difference is that it would make them safe to append to.

@cznic From what I understand it's compatible with the Go1 guarantee as this is an implementation detail:

Unspecified behavior. The Go specification tries to be explicit about most properties of the language, but there are some aspects that are undefined. Programs that depend on such unspecified behavior may break in future releases.
...
Use of package unsafe. Packages that import unsafe may depend on internal properties of the Go implementation. We reserve the right to make changes to the implementation that may break such programs.

Otherwise it also wouldn't be possible for go to change the algorithm with which the growth factor of new slices is determined.

It's a programmers mistake to mutate the shared backing array.

With bytes.Split you would always get a shared backing array. If you're saying it's a mistake to mutate it (which I agree with), why not defend the programmer a bit to make sure you can only mutate it inside each slice it's own section of the shared backing array. There's no performance and behaviour difference when not appending. When appending it only changes performance a bit and the original behaviour is so hard to predict that I cannot imagine it's used for anything useful.

@ghost
Copy link

ghost commented Jul 25, 2017

@cznic

That'd make the result "append-safe", but at the cost of worse performance for code that does the same while aware of/insensitive to the behind-the-scenes append semantics.

Have you any examples of such code in mind? Did you benchmark the append-safe version against the append-unsafe one?

The proposed "fix" is not Go1 guarantee compatible

Setting the capacities inside bytes.Split is Go1 compatible (it falls under the "Unspecified behavior" category).

@cznic
Copy link
Contributor

cznic commented Jul 25, 2017

@JelteF

From what I understand it's compatible with the Go1 guarantee as this is an implementation detail:

Implementation detail of what? Your quote mentions UB and unsafe.Pointer, neither seems relevant here. What I'm talking about is, I believe, specified behavior, particularly

If the capacity of s is not large enough to fit the additional values, append allocates a new, sufficiently large underlying array that fits both the existing slice elements and the additional values. Otherwise, append re-uses the underlying array. (src).

  1. Let me assume it's not reasonable to expect bytes.Split to really split its argument into slices with separate backing arrays. That really would be an implementation detail, but it would make the function performance-wise disqualifying for stdlib.

  2. Go programs are allowed to expect the the bytes.Split result items to not be limited by the three index slicing operation, simply because the Go1 guarantee predates the three index slicing introduced few year later.

The proposal can break existing Go programs and above I tried to explain why I think this does fall under the compatibility guarantee.

@opennota

Have you any examples of such code in mind?

I don't. Of course that does not mean no such code exists in the wild.

Did you benchmark the append-safe version against the append-unsafe one?

No need to benchmark. Appending one byte to a byte slice is O(1) if len(s) < cap(s), O(n) otherwise. The propsal can change the class of the append operation on the result element, and n can be arbitrary large.

@JelteF
Copy link
Contributor Author

JelteF commented Jul 25, 2017

Implementation detail of what?

I meant that it is unspecified behaviour of bytes.Split, the docs do not mention this.

  1. Go programs are allowed to expect the the bytes.Split result items to not be limited by the three index slicing operation, simply because the Go1 guarantee predates the three index slicing introduced few year later.

I think this reasoning is flawed. Even without the three slicing operation the stdlib could do this, by modifying the capacity in the slice header using unsafe.

The proposal can break existing Go programs and above I tried to explain why I think this does fall under the compatibility guarantee.

I don't. Of course that does not mean no such code exists in the wild.

What programs would it break? Ones doing this on purpose? I cannot imagine what these would be. These theoretical programs would be super error prone anyway because depending on how much you append you get different behaviour. If you append too much it doesn't overwrite the other the following slice: https://play.golang.org/p/EoSShvpBgr

No need to benchmark. Appending one byte to a byte slice is O(1) if len(s) < cap(s), O(n) otherwise. The propsal can change the class of the append operation on the result element, and n can be arbitrary large.

This is true, but I believe this is a fair tradeof. Because fast but wrong behaviour is worse than slow but correct behaviour.

@cznic
Copy link
Contributor

cznic commented Jul 25, 2017

What programs would it break? Ones doing this on purpose? ...

The compatibility guarantee is about correct, valid programs continue to do what they did before unless -
a list of exceptions follows. Doing something on purpose or classifying a program as theoretical is not on that list, is it?

Anyway, code like

for _, v := range bytes.Split(s, []byte{foo}) {
        if condition {
                return append(v, bar...)
        }
}

seems perfectly normal to me.

This is true, but I believe this is a fair tradeof. Because fast but wrong behaviour is worse than slow but correct behaviour.

I concur, but making all correct code, like above, slower to bypass a bug in someone else's code is IMO not reasonable.

@JelteF
Copy link
Contributor Author

JelteF commented Jul 25, 2017

The compatibility guarantee is about correct, valid programs continue to do what they did before unless -
a list of exceptions follows. Doing something on purpose or classifying a program as theoretical is not on that list, is it?

I think you are right about this. (although maybe you could put it as a security issue, because it could cause buffer overflows in user buffers)

However, the code you posted would not be broken. It would still return the same thing, it would just be a bit slower. Performance is not part of the of the Go1 guarantee:

Finally, although it is not a correctness issue, it is possible that the performance of a program may be affected by changes in the implementation of the compilers or libraries upon which it depends. No guarantee can be made about the performance of a given program between releases.

I concur, but making all correct code, like above, slower to bypass a bug in someone else's code is IMO not reasonable.

I think this final statement is a matter of opinion, and I do not agree with yours in this case. I would love to hear some other opinions.

Finally if it's not deemed possible or desired to change the Split function. I would definitely like a warning in the docs though (also with SplitN). I think it's good to warn people in cases like this where mistakes are easy to make. The same is done for the append builtin saying you should always save the result:

It is therefore necessary to store the result of append, often in the variable holding the slice itself

Maybe a new function could even be added where the output is append safe, named SplitAppendSafe(After)(N) or something similar. Because I think I might not be the only person that would like to have this behaviour.

@cznic
Copy link
Contributor

cznic commented Jul 25, 2017

I would definitely like a warning in the docs though (also with SplitN).

It's not specific to the results of Split/SplitN. In the general case, every append statement in a program can cause a change in some other slice viewing the underlying backing array of append's first argument. Documenting it wherever a slice appears in the stdlib is probably a non-starter.

Append has side effects. Side effect can be welcome when intended and evil when unconsidered. I'm afraid the only help is to know and understand the side effects well. For sure, I too was bitten by append's semantics. I guess more than once ;-)

@james-antill
Copy link

james-antill commented Jul 25, 2017

It's not specific to the results of Split/SplitN. In the general case, every append statement in a program can cause a change in some other slice viewing the underlying backing array of append's first argument

This is misleading, at best. Yes, Join() (and other generic functions) can't assume that the slice views it gets are non-overlapping and/or alterable. However Split/SplitN/Fields create all of the sliced views, so why not control them so altering doesn't break?

Realistically the standard recommendation will be to create code like Join() does where you never alter the data returned by Split(), because it can't be trusted ... so I don't see how fixing this is a performance regression.

Documenting it wherever a slice appears in the stdlib is probably a non-starter.

Nobody is asking for that.
Almost no stdlib APIs return a [][], regexp.FindSubMatch()/etc. looks like it has a similar problem to Split() and those APIs should have the same warning for the same reasons.

The only other ones I can find are csv/suffixarray and both are safe to alter.

@ianlancetaylor
Copy link
Contributor

I think it would be very useful to see examples of real programs where the current behavior of bytes.Split caused a bug during development. It would also be very useful to see examples that depend on the current behavior.

I'm somewhat sympathetic to the change; we might have designed bytes.Split to work that way if we had had three-index slices at the time that bytes.Split was written. But changing it today would clearly break compatibility in a way that seems to me to be hard to detect. It's a subtle change, and we could perhaps consider making it if the current behavior is causing bugs in real programs, and if nobody is relying on the current behavior. But in the absence of any examples I think my vote would have to be to, unfortunately, simply better document the current behavior.

@JelteF
Copy link
Contributor Author

JelteF commented Jul 25, 2017

@ianlancetaylor The case where I ran into it went something like this:

  1. Store a value in RocksDB formatted prop1:unusedprop:prop2 (props are known not to have :)
  2. Retrieve key and split on : store the prop1 and prop2 in a struct (even copying the input to Split for safety)
  3. Everything is fine.
  4. Couple of weeks later decide there should be a prop3 field with prop1 plus some extra data (in a much later stage than the original Splitting).
  5. Replace prop1 with a copy of prop1.
  6. Save append(prop1, extradata...) in prop3
  7. When extradata is of a certain length (longer than unusedprop) it suddenly starts appearing in prop2.

@cznic
Copy link
Contributor

cznic commented Jul 25, 2017

Save append(prop1, extradata...) in prop3

Yep, it should be append(prop1[:len(prop1):len(prop1)], extradata...).

@JelteF
Copy link
Contributor Author

JelteF commented Jul 25, 2017

@cznic There's like a 10 different solutions. That wasn't really the point. The point was to show what kind of bug this caused for me.

@cznic
Copy link
Contributor

cznic commented Jul 25, 2017

There's like a 10 different solutions.

That was just a "for the record" for future readers.

BTW: Wrt appending to a slice which backing array must not be mutated. I know of only one reasonable solution. Can you please share some others?

@JelteF
Copy link
Contributor Author

JelteF commented Jul 25, 2017

Ok, 10 was a bit exagarated but:

  1. I could have changed on which slice I did the append, if I did it on the copied one all would be fine
  2. You could set the capacity when getting the slices back from Split
  3. You can copy all slices coming from the Split function.
  4. Use string instead of []byte

@james-antill
Copy link

I think it would be very useful to see examples of real programs where the current behavior of bytes.Split caused a bug during development

This is quite hard to search for, you need usage of bytes.Split() followed by using append() on the return value ... then it needs to be live (or you find the commit of when it was fixed).

I found:

#19693

https://stackoverflow.com/questions/37281626/strange-behavior-when-appending-to-a-2d-slice

This looks bad, but might be currently safe:

https://github.com/gitchain/gitchain/blob/2baefefd1795b358c98335f120738b60966fa09d/git/objects.go#L221

This fails, sometimes, if line 34 is removed (is that an accident or is it known to have a dual purpose):

https://github.com/cpuguy83/go-md2man/blob/23709d0847197db6021a51fdb193e66e9222d4e7/md2man/roff.go#L24

This works because they do it on the last element:

https://github.com/kyoh86/richgo/blob/b5351b59507f266319400d92db9ed195c6aec9b3/editor/editor.go#L42

It would also be very useful to see examples that depend on the current behavior.

This is a subset of the first, where it's an intentional use of the feature. For what it's worth I couldn't find anything, although lots of code doesn't hit the bug by converting to strings, using an empty slice as an initial result or calling something like bytes.Replace() ... all of which hint that very few people are micro-optimizing in this way.

@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Jul 25, 2017
@ianlancetaylor
Copy link
Contributor

@james-antill Thanks very much for doing this research.

@JelteF
Copy link
Contributor Author

JelteF commented Jul 25, 2017

@james-antill good finds! Do I understand correctly that you also find commits that fix bugs related to this, because those might still be relevant. Since even though they're fixed now people still had bugs because of the current behaviour.

@james-antill
Copy link

james-antill commented Jul 26, 2017

Do I understand correctly that you also find commits that fix bugs related to this

No, I tried to search for commits that fixed things ... but had a hard time even getting just commits in the results from google. codesearch doesn't do it AFAIK. github makes that easy but the language selector doesn't work for commits and it has the terrible feature of showing you the same commit for each fork/branch ... which doesn't help when golang itself has 4k forks.

@christophberger
Copy link

May I chime in with a silly suggestion.

The bug report makes a valid point. It is the nature of bytes.Split() to return slices that are adjacent to each other in memory, often separated by only a single byte. Having append() overwriting subsequent slices can indeed be a source of subtle errors, and I have yet to see a convincing use case for appending into a subsequent slice.

Yet the counter arguments are equally valid. Performance has always been a primary goal in Go, and so has the compatibility promise.

So as a compromise, how about introducing a new set of functions, SafeSplit() and SafeSplitN() that return slices with cap set to len. Stdlib users then can decide which behavior they prefer for their particular scenario.

@JelteF
Copy link
Contributor Author

JelteF commented Sep 16, 2017

@christophberger this solution would be fine by me as well. I also like it more than just putting the current behaviour in the documentation, since now the documentation could also point to an easy to use solution.

@everyone This is my first issue at the Go repo, so what is the correct way to preceed now? It doesn't seem there are any other arguments to be made. So I think a decision should be made between the 4 choices (I put emojis behind them so it's possible to vote on them):

  1. Change the existing functions to return slices with non overlapping capacity 👍
  2. Create Safe versions of all the functions, which wil return slices with non overlapping capacity 😄
  3. Document the current behaviour, and state that appending is not safe to do 😕
  4. Keep everything like it is 👎

@rsc
Copy link
Contributor

rsc commented Oct 30, 2017

OK, let's try this and roll it back if people start reporting ways that their code depends on the old behavior that we didn't expect.

@rsc rsc added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 30, 2017
@gopherbot gopherbot removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 30, 2017
@rsc
Copy link
Contributor

rsc commented Oct 30, 2017

CL welcome but note that we're up against the Go 1.10 freeze deadline.

@JelteF
Copy link
Contributor Author

JelteF commented Oct 30, 2017

I'm up for providing a CL, but I don't think it will be in time for the 1.10 freeze.

@gopherbot
Copy link

Change https://golang.org/cl/74530 mentions this issue: bytes: return slices where capacities don't overlap

@JelteF
Copy link
Contributor Author

JelteF commented Oct 30, 2017 via email

@gopherbot
Copy link

Change https://golang.org/cl/74510 mentions this issue: bytes: set cap of slices returned by Split

@golang golang locked and limited conversation to collaborators Nov 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants