Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: io: flatten nested SectionReaders #63673

Open
danderson opened this issue Oct 22, 2023 · 7 comments
Open

proposal: io: flatten nested SectionReaders #63673

danderson opened this issue Oct 22, 2023 · 7 comments
Labels
Milestone

Comments

@danderson
Copy link
Contributor

danderson commented Oct 22, 2023

What

Make io.NewSectionReader check if the provided ReaderAt is a SectionReader, and if so return a SectionReader that operates directly on the underlying ReaderAt. In other words, this code would result in a single SectionReader over bytes 120-130 of f, rather than 3 nested SectionReaders:

var f *os.File

sr1 := io.NewSectionReader(f, 100, 100)
sr2 := io.NewSectionReader(sr1, 0, 50)
sr3 := io.NewSectionReader(sr2, 20, 10)

// equivalent to:
sr3 := io.NewSectionReader(f, 120, 10)

Why

io.SectionReader is a way to carve up an io.ReaderAt into subsections. This is very handy when parsing file formats like Matroska, which heavily incentivize random seeking and contain a tree of size-delimited elements. A reasonable incremental parsing API for Matroska's basic framing looks like:

type Element struct {
  ID uint64
  Size uint64
  Data io.ReaderAt
}

func NextElement(r io.ReaderAt) (elem *Element, remaining io.ReaderAt, err error)

Leaving aside the gubbins of parsing Matroska framing, NextElement slices the provided ReaderAt into a pair of io.SectionReaders, one with the element's bytes and one for the remainder of the file. Each of these SectionReaders can be fed back into NextElement to either parse the children of the returned Element, or the Element's next sibling.

However, if you do this, currently you end up with a deep tree of nested SectionReaders, as each returned SectionReader get wrapped in further SectionReaders by subsequent calls. Each layer of the tree adds a small amount of overhead as it does offset and EOF math on progressively smaller chunks of the original ReaderAt.

Compatibility

This may be too trivial a change to warrant a proposal, since it doesn't make an obvious API change. However, it does interact with #61870 : once SectionReader.Outer is released to stable, flattening out intermediate SectionReaders is a semantic change:

sr := io.NewSectionReader(r, 42, 100)
sr2 := io.NewSectionReader(sr, 50, 10)

Without this proposal, sr.Outer() returns r, whereas sr2.Outer() returns sr. With this proposal, both return r (with appropriately adjusted offset/limit, of course). Further, in 1ce8a3f, Outer explicitly documents that it always returns the ReaderAt that was passed to NewSectionReader.

So, if this flattening change seems worth doing, it should happen before the next stable release, while Outer's semantics aren't yet set in stone by the compatibility promise.

Arguing against myself, I will also note that SectionReader.Outer combined with type assertions is sufficient for the caller (e.g. my Matroska parser) to do this un-nesting itself, with no stdlib changes. I'm making this proposal despite that, because it feels preferable to me for the stdlib to do this simplification by default, rather than require callers to watch out for pathological nesting of SectionReaders. There is precedent for this in bufio.NewReader, which has special handling for bufio.Reader inputs to avoid similar inefficient nesting.

@gopherbot gopherbot added this to the Proposal milestone Oct 22, 2023
@seankhliao seankhliao changed the title proposal: affected/package: proposal: io: flatten nested SectionReaders Oct 22, 2023
@Jorropo
Copy link
Member

Jorropo commented Oct 23, 2023

See race condition issues too: https://go-review.googlesource.com/c/go/+/500476

I think the simplest way to do that is to always return a new section reader and try the unwrapping in .Read, however this might be expensive or require extra new synchronisation, making the sectionreader object bigger.

@danderson
Copy link
Contributor Author

Can you explain where the race is in your change? I can't see the semantic subtlety, it seems to me that if the original nested SectionReader logic is race-free, then the change should be as well? It only touches r, base and limit from the original SectionReader, which all seem to be constant after construction. What am I missing?

@Jorropo
Copy link
Member

Jorropo commented Oct 23, 2023

https://go-review.googlesource.com/c/go/+/500476/comments/aef8d7b0_b285c620

However now that I checked, .Read and .Seek writes to .off, all the methods read from .base and .limit, the unwrapping also read .base and .limit so actually it looks fine.

@Jorropo
Copy link
Member

Jorropo commented Oct 23, 2023

I don't think this is possible due to this:

const s = "hello world"
b := strings.NewReader(s)
s1 := io.NewSectionReader(b, 3, len(s)-3)
s2 := io.NewSectionReader(b, 1, len(s)-4)

s1.Seek(2, io.SeekStart)
s2.Read(buf) // reads "world" because s1.base (3) + s1.off (2) + s2.base (1) = 6, so "hello " is chopped off.

@danderson
Copy link
Contributor Author

I don't think that's right. SectionReader only ever calls ReadAt on the underlying r, and Read and Seek are implemented in terms of ReadAt + locally tracking s.off. If you construct two nested SectionReaders today, their Seek/Read cursors are unrelated and are free to diverge. Decoupling the two SectionReaders from each other in NewSectionReader doesn't seem like it changes those semantics at all, from the user's POV both SectionReaders still behave identically.

In particular, in the example you gave, s2's read is not influenced by s1.off, because s2.Read invokes s1.ReadAt with an absolute position, and s1.ReadAt doesn't incorporate s1.off into its calculations.

@CAFxX
Copy link
Contributor

CAFxX commented Oct 24, 2023

However, it does interact with #61870 : once SectionReader.Outer is released to stable, flattening out intermediate SectionReaders is a semantic change:

sr := io.NewSectionReader(r, 42, 100)
sr2 := io.NewSectionReader(sr, 50, 10)

Without this proposal, sr.Outer() returns r, whereas sr2.Outer() returns sr. With this proposal, both return r (with appropriately adjusted offset/limit, of course). Further, in 1ce8a3f, Outer explicitly documents that it always returns the ReaderAt that was passed to NewSectionReader.

I don't think the two necessarily interact. Isn't it enough to just add an unexported field to store the original/non-flattened reader, and use that in Outer? We already added another field for that very reason in #61870.

Basically, the way it could work is that in NewSectionReader we store all the original values (for use by Outer) and then we attempt to flatten the readers and store the flattened result in a second set of fields, used by all the other methods.

@danderson
Copy link
Contributor Author

danderson commented Oct 25, 2023

We could do that, with the obvious cost of making the structure fatter for a relatively niche semantic difference. If we're forced to maintain compatibility, that's certainly an option, but I think I would advocate for adjusting the semantics of Outer to make the more efficient internals possible.

(That is, assuming I'm right that Outer hasn't shipped in a stable release yet. If it has, then we have to do what you propose)

For my particular use case in Matroska parsing, I think the two options are broadly equivalent because I expect the intermediate SectionReaders to have approximately the same lifetime as the child SectionReaders, so it's all about reducing processing cost rather than memory cost. That said, for other parsing use cases, it may be nice to not force the pinning of intermediate readers, so that the programmer is free to optimize for readability, even if it means throwing around a few more SectionReaders during the parse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

4 participants