proposal: io: flatten nested SectionReaders #63673

danderson · 2023-10-22T19:58:21Z

What

Make io.NewSectionReader check if the provided ReaderAt is a SectionReader, and if so return a SectionReader that operates directly on the underlying ReaderAt. In other words, this code would result in a single SectionReader over bytes 120-130 of f, rather than 3 nested SectionReaders:

var f *os.File

sr1 := io.NewSectionReader(f, 100, 100)
sr2 := io.NewSectionReader(sr1, 0, 50)
sr3 := io.NewSectionReader(sr2, 20, 10)

// equivalent to:
sr3 := io.NewSectionReader(f, 120, 10)

Why

io.SectionReader is a way to carve up an io.ReaderAt into subsections. This is very handy when parsing file formats like Matroska, which heavily incentivize random seeking and contain a tree of size-delimited elements. A reasonable incremental parsing API for Matroska's basic framing looks like:

type Element struct {
  ID uint64
  Size uint64
  Data io.ReaderAt
}

func NextElement(r io.ReaderAt) (elem *Element, remaining io.ReaderAt, err error)

Leaving aside the gubbins of parsing Matroska framing, NextElement slices the provided ReaderAt into a pair of io.SectionReaders, one with the element's bytes and one for the remainder of the file. Each of these SectionReaders can be fed back into NextElement to either parse the children of the returned Element, or the Element's next sibling.

However, if you do this, currently you end up with a deep tree of nested SectionReaders, as each returned SectionReader get wrapped in further SectionReaders by subsequent calls. Each layer of the tree adds a small amount of overhead as it does offset and EOF math on progressively smaller chunks of the original ReaderAt.

Compatibility

This may be too trivial a change to warrant a proposal, since it doesn't make an obvious API change. However, it does interact with #61870 : once SectionReader.Outer is released to stable, flattening out intermediate SectionReaders is a semantic change:

sr := io.NewSectionReader(r, 42, 100)
sr2 := io.NewSectionReader(sr, 50, 10)

Without this proposal, sr.Outer() returns r, whereas sr2.Outer() returns sr. With this proposal, both return r (with appropriately adjusted offset/limit, of course). Further, in 1ce8a3f, Outer explicitly documents that it always returns the ReaderAt that was passed to NewSectionReader.

So, if this flattening change seems worth doing, it should happen before the next stable release, while Outer's semantics aren't yet set in stone by the compatibility promise.

Arguing against myself, I will also note that SectionReader.Outer combined with type assertions is sufficient for the caller (e.g. my Matroska parser) to do this un-nesting itself, with no stdlib changes. I'm making this proposal despite that, because it feels preferable to me for the stdlib to do this simplification by default, rather than require callers to watch out for pathological nesting of SectionReaders. There is precedent for this in bufio.NewReader, which has special handling for bufio.Reader inputs to avoid similar inefficient nesting.

The text was updated successfully, but these errors were encountered:

Jorropo · 2023-10-23T12:03:46Z

See race condition issues too: https://go-review.googlesource.com/c/go/+/500476 ✨

I think the simplest way to do that is to always return a new section reader and try the unwrapping in .Read, however this might be expensive or require extra new synchronisation, making the sectionreader object bigger.

danderson · 2023-10-23T18:43:42Z

Can you explain where the race is in your change? I can't see the semantic subtlety, it seems to me that if the original nested SectionReader logic is race-free, then the change should be as well? It only touches r, base and limit from the original SectionReader, which all seem to be constant after construction. What am I missing?

Jorropo · 2023-10-23T22:06:17Z

https://go-review.googlesource.com/c/go/+/500476/comments/aef8d7b0_b285c620

However now that I checked, .Read and .Seek writes to .off, all the methods read from .base and .limit, the unwrapping also read .base and .limit so actually it looks fine.

Jorropo · 2023-10-23T22:15:22Z

I don't think this is possible due to this:

const s = "hello world"
b := strings.NewReader(s)
s1 := io.NewSectionReader(b, 3, len(s)-3)
s2 := io.NewSectionReader(b, 1, len(s)-4)

s1.Seek(2, io.SeekStart)
s2.Read(buf) // reads "world" because s1.base (3) + s1.off (2) + s2.base (1) = 6, so "hello " is chopped off.

danderson · 2023-10-23T22:50:54Z

I don't think that's right. SectionReader only ever calls ReadAt on the underlying r, and Read and Seek are implemented in terms of ReadAt + locally tracking s.off. If you construct two nested SectionReaders today, their Seek/Read cursors are unrelated and are free to diverge. Decoupling the two SectionReaders from each other in NewSectionReader doesn't seem like it changes those semantics at all, from the user's POV both SectionReaders still behave identically.

In particular, in the example you gave, s2's read is not influenced by s1.off, because s2.Read invokes s1.ReadAt with an absolute position, and s1.ReadAt doesn't incorporate s1.off into its calculations.

CAFxX · 2023-10-24T05:28:40Z

However, it does interact with #61870 : once SectionReader.Outer is released to stable, flattening out intermediate SectionReaders is a semantic change:
sr := io.NewSectionReader(r, 42, 100)
sr2 := io.NewSectionReader(sr, 50, 10)
Without this proposal, sr.Outer() returns r, whereas sr2.Outer() returns sr. With this proposal, both return r (with appropriately adjusted offset/limit, of course). Further, in 1ce8a3f, Outer explicitly documents that it always returns the ReaderAt that was passed to NewSectionReader.

I don't think the two necessarily interact. Isn't it enough to just add an unexported field to store the original/non-flattened reader, and use that in Outer? We already added another field for that very reason in #61870.

Basically, the way it could work is that in NewSectionReader we store all the original values (for use by Outer) and then we attempt to flatten the readers and store the flattened result in a second set of fields, used by all the other methods.

danderson · 2023-10-25T17:59:07Z

We could do that, with the obvious cost of making the structure fatter for a relatively niche semantic difference. If we're forced to maintain compatibility, that's certainly an option, but I think I would advocate for adjusting the semantics of Outer to make the more efficient internals possible.

(That is, assuming I'm right that Outer hasn't shipped in a stable release yet. If it has, then we have to do what you propose)

For my particular use case in Matroska parsing, I think the two options are broadly equivalent because I expect the intermediate SectionReaders to have approximately the same lifetime as the child SectionReaders, so it's all about reducing processing cost rather than memory cost. That said, for other parsing use cases, it may be nice to not force the pinning of intermediate readers, so that the programmer is free to optimize for readability, even if it means throwing around a few more SectionReaders during the parse.

danderson added the Proposal label Oct 22, 2023

gopherbot added this to the Proposal milestone Oct 22, 2023

seankhliao changed the title ~~proposal: affected/package:~~ proposal: io: flatten nested SectionReaders Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: io: flatten nested SectionReaders #63673

proposal: io: flatten nested SectionReaders #63673

danderson commented Oct 22, 2023 •

edited

Jorropo commented Oct 23, 2023

danderson commented Oct 23, 2023

Jorropo commented Oct 23, 2023

Jorropo commented Oct 23, 2023

danderson commented Oct 23, 2023

CAFxX commented Oct 24, 2023 •

edited

danderson commented Oct 25, 2023 •

edited

proposal: io: flatten nested SectionReaders #63673

proposal: io: flatten nested SectionReaders #63673

Comments

danderson commented Oct 22, 2023 • edited

What

Why

Compatibility

Jorropo commented Oct 23, 2023

danderson commented Oct 23, 2023

Jorropo commented Oct 23, 2023

Jorropo commented Oct 23, 2023

danderson commented Oct 23, 2023

CAFxX commented Oct 24, 2023 • edited

danderson commented Oct 25, 2023 • edited

danderson commented Oct 22, 2023 •

edited

CAFxX commented Oct 24, 2023 •

edited

danderson commented Oct 25, 2023 •

edited