Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: io/fs, filepath: add more efficient Walk alternative #41974

Closed
rsc opened this issue Oct 14, 2020 · 21 comments
Closed

proposal: io/fs, filepath: add more efficient Walk alternative #41974

rsc opened this issue Oct 14, 2020 · 21 comments

Comments

@rsc
Copy link
Contributor

rsc commented Oct 14, 2020

There are at least three problems with filepath.Walk:

  • The os.FileInfo passed to the callback keeps Walk from using the new, more efficient ReadDir API (os: add ReadDir method for lightweight directory reading #41467).
  • The function signature of the callback is a bit hard to remember - I always have to look it up.
  • The error handling, and in particular SkipDir, is a bit unusual and error-prone.

@kr's Walker API takes a different approach that solves all these, by providing an explicit iterator object instead of a callback-based model. I propose to adapt this API to provide an alternative to filepath.Walk and then also adopt this for io/fs instead of filepath.Walk.

The new API is:

// A Walker iterates over the directory entries of a file system subtree.
// The entries are walked in lexical order, which makes the traversal
// predictable but means that very large directories must be read
// in their entirety before any entries can be processed.
// (Note that Walker is itself an implementation of os.DirEntry.)
//
// Each directory is visited twice in the traversal:
// once before processing and once after processing.
// The two cases are distinguished by the Exit method.
type Walker struct {
	...
}

// NewWalker
func NewWalker(root string) (*Walker, error)

// Next advances the Walker to visit the next result in the traversal.
// It returns true on success, or false if the entire tree has been visited.
// Next must be called for each step in the traversal, including the first.
func (w *Walker) Next() bool

// Path returns the path the walker is currently visiting.
func (w *Walker) Path() string

// Rel returns the relative path from the root of the tree being walked
// to the entry being visited. The path is either "." or else a sequence of
// Separator-separated path elements that does not contain "." or "..".
func (w *Walker) Rel() string

// Name returns the final path element of the path being visited.
func (w *Walker) Name() string

// IsDir reports whether the path being visited is a directory.
func (w *Walker) IsDir() bool

// Type reports the type of the path being visited.
func (w *Walker) Type() os.FileMode

// Info returns the file information for the path being visited.
// It may return information obtained when the path's parent
// directory was read or (when visiting the root) when NewWalker was called,
// or it may return information obtained during the call to Info.
func (w *Walker) Info() (os.FileInfo, error)

// Exiting reports whether this step in the traversal marks the
// exiting of a directory.
// Each directory is visited twice by the Walker:
// before and after visiting the directory's children.
// During the first visit, Exiting returns false; during the second, it returns true.
// If the path being visited is not a directory, Exiting always returns false.
func (w *Walker) Exiting() bool

// Err returns any error encountered trying to visit the current path.
// Err can be non-nil before the initial call to Next,
// indicating a problem with the argument passed to NewWalker.
// In this case, the initial call Next will return false.
// Otherwise, Err can be non-nil when Exiting returns true,
// indicating a problem reading the directory being exited.
func (w *Walker) Err() error

// SkipDir instructs the walker to skip the remainder of the current directory.
// The subsequent call to Next will advance the walker to the next step
// in the traversal for which Exiting is true.
// That is:
//  - If IsDir() == true and Exiting() == false (entering a directory),
//     calling SkipDir skips over the children of that directory entirely.
//  - If IsDir() == false (visiting a file inside a directory),
//     calling SkipDir skips over the remaining siblings of the current file.
//  - If IsDir() == true and Exiting() == true (exiting a directory),
//     calling SkipDir skips over the remaining siblings of the current directory,
//     so that Next advances to exiting the parent of the current directory.
func (w *Walker) SkipDir()

An example:

w := NewWalker("/does/not/exist")
if err := w.Err(); err != nil {
	log.Print(err)
}
for w.Next() {
	if err := w.Err(); err != nil {
		log.Print(err)
	}
	if !w.Exiting() {
		fmt.Println(w.Rel())
	}
}

This API differs from github.com/kr/fs.Walker in that it reports both entry and exit from every directory. In contrast, kr/fs.Walker only reports an exit-directory step in case of a ReadDir error, and it provides no clear way to distinguish the second result (except implicitly that Info().IsDir() == false && Err() != nil indicates an “exiting” event, but that fact is undocumented).

The API also differs in that it provides all the os.DirEntry methods, where kr/fs.Walker provides only Info() os.FileInfo. A possible simplification of the API would be to provide DirEntry() os.DirEntry instead of Name, IsDir, Type, Info, but those are likely to be very commonly called, and forcing the user to write (say) w.DirEntry().IsDir() seems unnecessarily pedantic.

Rel is not strictly needed (nor included in kr/fs.Walk), but I've needed that result in almost every filepath.Walk call I've ever written. It can be derived from Path, but doing so is tricky.

And of course kr/fs.Walker's constructor is named Walk. We must use NewWalker instead because filepath.Walk is taken. For io/fs we may be able to call it Walk.

@rsc rsc added the Proposal label Oct 14, 2020
@rsc rsc added this to the Proposal milestone Oct 14, 2020
@rsc rsc added this to Active in Proposals (old) Oct 14, 2020
@jimmyfrasche
Copy link
Member

NewWalker should return an error instead of overloading the Err method: less to explain in the docs, harder to miss.

@bcmills
Copy link
Contributor

bcmills commented Oct 14, 2020

I think the Err and Exiting methods of this API will be too easy for the caller to either omit or invoke at the wrong time. (For example, I expect that many callers will forget the Exiting method and end up with unexpected duplicates at run-time, which may or may not be detected by the surrounding code.)

Unfortunately, I don't have the bandwidth to look into less error-prone alternatives to suggest at the moment.

@bcmills
Copy link
Contributor

bcmills commented Oct 14, 2020

I also think it's unfortunate that this Walker implements the DirEntry interface from #41467.

The DirEntry implementations returned by the proposed ReadDir are, from what I understand, immutable and durable, whereas the DirEntry implemented by Walker changes meaning every time Next is called. It will be easy for someone to accidentally assign the Walker to a DirEntry variable — or pass it to another function that expects the DirEntry to be durable — without noticing the mismatch.

@ianlancetaylor
Copy link
Contributor

As far as I can tell the Exiting method exists only to clearly report an error from ReadDir on a directory. Since errors are rare, I agree that this seems like an undesirable API complication. Personally I think it would be OK for the error case to return the directory a second name--same Name, same IsDir--but with a non-nil Err.

@ianlancetaylor
Copy link
Contributor

I'm not clear on why we need all of Path, Name, and Rel. Is Path an absolute path where Rel is a relative path?

@jimmyfrasche
Copy link
Member

I've needed the depth before. It would be pretty easy to wrap the walker to collect and report that. The same could be said for Path and Rel, though.

@bcmills
Copy link
Contributor

bcmills commented Oct 15, 2020

Giving this a bit more thought, I think the API I would prefer is something like:

// NewWalker returns a Walker that walks the subtree of fsys rooted at path root.
func NewWalker(fsys FS, root string) *Walker

// A Walker iterates over the entries in a filesystem tree.
type Walker struct {
	…
}

// Done reports whether the last valid entry returned was for
// the root of the walk.
func (w *Walker) Done() bool

// Next advances the walker to the next entry within the current directory
// and returns that entry.
// If no entries remain, Next returns a nil *WalkEntry and io.EOF.
// Otherwise, Next always returns a non-nil *WalkEntry.
//
// A new Walker includes only one entry, for the root of the walk itself.
// The first call to Next on a new Walker returns the root of the walk.
// If the root is not entered, the next call to Done will return true.
//
// If the root does not exist, the first call to Next returns a non-nil
// WalkEntry and an error for which os.IsNotExist returns true.
func (w *Walker) Next() (*WalkEntry, error)

// EnterDir causes the Walker to begin traversing the directory
// corresponding to the last entry returned by Next.
//
// EnterDir returns a non-nil error (and leaves the walker
// unchanged) if:
// the entries for the directory could not be identified,
// Next has not yet been called for the current directory,
// the last call to Next did not return a directory,
// or ExitDir was called after the last call to Next.
func (w *Walker) EnterDir() error

// ExitDir returns the WalkEntry for the current directory, and
// causes the walker to move back to its parent directory.
// The subsequent call to Next will return the next sibling (if any)
// of the returned entry.
//
// If the current directory was already the root of the walk,
// ExitDir returns a nil *WalkEntry with ok=false.
func (w *Walker) ExitDir() (parent *WalkEntry, ok bool)


// A WalkEntry is an entry within a directory encountered during a Walk.
type WalkEntry struct {
	…
}

// IsDir reports whether the entry describes a subdirectory.
func (e *WalkEntry) IsDir() bool

// Name returns the name of the entry within its parent directory.
func (e *WalkEntry) Name() string

// Path returns the path by which the walker reached this entry.
func (e *WalkEntry) Path() string

// Rel returns the relative path from the root of the tree being walked
// to the entry. The path is either "." or else a sequence of
// Separator-separated path elements that does not contain "." or "..".
func (e *WalkEntry) Rel() string

// Info returns the FileInfo for the file or subdirectory described by the entry.
func (e *WalkEntry) Info() (FileInfo, error)

The corresponding implementation of the example might look like:

	w, err := fs.Walk(os.FS(), "/does/not/exist")
	for !w.Done() {
		e, err := w.Next()
		if err == io.EOF {
			w.ExitDir()
			continue
		}

		if err != nil {
			log.Print(err)
		}
		log.Print(e.Rel())
		if e.IsDir() {
			if err := w.EnterDir(); err != nil {
				log.Print(err)
			}
		}
	}

That API addresses the same issues as the original proposal, but in my opinion provides more robust error-handling (since the entry cannot be accessed without also binding the error to either a variable or _), less risk of aliasing bugs (because the returned WalkEntry instances are not mutated in place), and a clearer overall traversal (because the directory transitions are explicit rather than implicit).

@rsc
Copy link
Contributor Author

rsc commented Oct 16, 2020

FWIW, the walker API I wrote originally has the benefit of matching existing patterns for iterators, specifically sql.Rows and reflect.MapIter.

@bcmills
Copy link
Contributor

bcmills commented Oct 16, 2020

That's true, but experiences with iterators like sql.Rows and bufio.Scanner are exactly why I'm concerned about error-handling mistakes and aliasing bugs.

Also, sql.Rows and reflect.MapIter both iterate over flat data structures, not trees, so they don't really set much precedent one way or the other for SkipDir/Exiting vs. EnterDir/ExitDir.

@bcmills
Copy link
Contributor

bcmills commented Oct 16, 2020

Thinking about the error-handling a bit more: probably Next doesn't need to return an error, because the only times it can fail are when entering a directory (covered by the error returned from EnterDir, although that one is unfortunately easy to accidentally drop) and when entering the root (which we could cover by returning an error from NewWalker).

Dropping the error from Next does mean that we can't read the directory entries lazily (because we have no way to report when reading fails), but perhaps that's fine because the ReadDir method returns a complete slice anyway.

@bcmills
Copy link
Contributor

bcmills commented Oct 16, 2020

Actually, I think that reinforces my concern about error-handling with the Err method as originally proposed.

The current proposal says:

// Otherwise, Err can be non-nil when Exiting returns true,
// indicating a problem reading the directory being exited.

But that seems surprising to me: I would expect errors to be possible only when entering directories, not when exiting them.

@bcmills
Copy link
Contributor

bcmills commented Oct 16, 2020

So, perhaps:

// Walk begins walking the subtree of fsys rooted at path root.
//
// The returned Walker initially contains only the root.
// If the root does not exist, Walk returns an empty, non-nil Walker
// and an error for which os.IsNotExist returns true.
func Walk(fsys FS, root string) (*Walker, error)

// Next advances the walker to the next entry within the current directory
// and returns that entry.
// If no entries remain, Next returns a nil *WalkEntry and ok=false.
//
// A new Walker includes only one entry, for the root of the walk itself.
// The first call to Next on a new Walker returns the root of the walk.
// If the root is not entered, the next call to Done will return true.
func (w *Walker) Next() (*WalkEntry, bool)

(with the remaining API per #41974 (comment).)

I'm still not quite happy about EnterDir() returning only an error because it's too easy to miss (#20803), but that's also true of a great many established Go APIs (such as io.Closer).

@jimmyfrasche
Copy link
Member

I had assumed that the error exiting the directory was only if the directory that was being read from no longer existed and that if it tried to open a directory that didn't exist it would just skip it

@bcmills
Copy link
Contributor

bcmills commented Oct 16, 2020

@jimmyfrasche, I would expect an error entering a directory to be possible due to a number of reasons, such as:

  • a permission error preventing the current user from listing the directory,
  • the entry not actually referring to a directory,
  • the entry being a directory on an unmounted filesystem, or on a networked filesystem for which a network error has occurred,
  • the entry being located in a corrupted sector of the disk

@jimmyfrasche
Copy link
Member

I guess in those cases the original api would enter the directory and on the next iteration immediately exit it and report the error?

@rsc
Copy link
Contributor Author

rsc commented Oct 16, 2020

Or we could just keep the callback and do a more surgical fix that only replaces FileInfo with DirEntry; see #42027 for that.

@gopherbot
Copy link

Change https://golang.org/cl/243916 mentions this issue: io/fs: add Walk

@earthboundkid
Copy link
Contributor

Is Exiting() bool necessary given that the old API didn't have it? Couldn't users who care just say if oldpath != w.Rel() { ... }?

@kr
Copy link
Contributor

kr commented Oct 20, 2020

Supposing for the moment that Exiting or something like it is worth having:

Looking at the docs for SkipDir as described, there are three cases, and two of those cases have the same behavior (skip subsequent siblings of the current entry). Maybe it would be better to have Entering instead of Exiting (and Entering would return false for files) so that there would only be two cases to describe for SkipDir. The return value of Entering would be sufficient to know what it will do (skip siblings vs don't descend).

And I think this would not make anything else already in this proposal impossible.

Just an idea.

@earthboundkid
Copy link
Contributor

I agree that this snippet is more intuitive than the alternative:

if w.Entering() {
   if matchesExcludePattern(w.Path()) {
       w.SkipDir()
       continue
   }
}
// ... it's not a new directory, do normal processing below

@rsc
Copy link
Contributor Author

rsc commented Oct 21, 2020

This is turning out a lot more complex than I thought it would - I thought it would be a simple matter of copying the kr/fs API.

And the primary goal was to allow use of the lazy ReadDir, but that can be done more directly by adding a filepath.WalkDir and fs.WalkDir that pass a DirEntry to the WalkFunc (#42027). That approach also has the benefit that code will be able to more easily update from filepath.Walk to filepath.WalkDir - the body of the WalkFunc may not need to change at all.

I'm retracting this proposal in favor of #42027. Thanks for all the feedback - it was very helpful.

@rsc rsc closed this as completed Oct 21, 2020
@rsc rsc moved this from Active to Declined in Proposals (old) Oct 21, 2020
@golang golang locked and limited conversation to collaborators Oct 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

7 participants