Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: path/filepath: Add SameFile and InTree #42202

Open
rasky opened this issue Oct 25, 2020 · 4 comments
Open

proposal: path/filepath: Add SameFile and InTree #42202

rasky opened this issue Oct 25, 2020 · 4 comments

Comments

@rasky
Copy link
Member

rasky commented Oct 25, 2020

This proposal is alternative to #42201.

When working with pathnames on filesystems, it is common to inspect the relations between two different paths. In prototypes, it is common to assume that pathnames are uniques and thus two pathnames refer to the same file iff the pathnames are equal. Since many applications operate on tree of files, it is also common to check whether a file is part of a specified tree given its root; this can naively be done with strings.HasPrefix(path, root).

In real world, filesystems offer a variety of different types of links which can make things harder. On top of this, Windows and UNIX have different filesystems with different concepts and incompatible APIs. So finding a common semantic (which is one of the goals of path/filepath) is complicated, and Go does not provide useful primitives for this.

In other languages such as C, it became common under UNIX to use a function such as realpath(3) to obtain a so-called "canonical name" of a path. By using such primitive, it is technically possible to implement equality and containment tests in the following way:

  • Equality: realpath(path1) == realpath(path2)
  • Containment: realpath(leaf) starts with realpath(root)

(Notice that there is a different kind of containment test, in which we don't want to follow leaf links, but this can be done with something like realpath(dir(leaf)) starts with realpath(root) so it's not really worth defining a different operation).

There are a few problems with canonicalization:

  • It's not really clear what a canonical pathname is. There is no official definition for it. For instance, realpath does not follow bind mounts. On Windows, there is a syscall called GetFinalPathNameByHandle which provides 4 different canonicalization modes, and it's not immediately clear which one should be the one to use (moreover, to make things more complicated, not all modes work on all different kind of file objects).
  • In general, if a user provides a pathname, it expects the application to use that specific way of naming files or directories in prints, logs and error messages. An API to canonicalize filenames encourages an engineering approach in which applications might store canonicalized pathnames, that would provide surprising results if provided back to the users. Again, this is probably less important on UNIX systems, but on Windows it is common for a sysadmin to map network drives such as K:, while users have absolute no idea what \\server\share means.

My proposal is to avoid the temptation to define a canonicalization function, and rather provide helpers to implement pathname tests for applications, to cover their needs. Thus, I propose to add the following functions:

package filepath

// SameFile checks whether path1 and path2 refer to the same underlying file, after all links
// have been resolved. It is a thin wrapper over os.SameFile, that provides convenience.
// On UNIX, SameFile works across symlinks, hardlinks and bind mounts.
// On Windows, SameFile works across symlinks, hardlinks, junctions, drive mappings.
func SameFile(path1, path2 string) bool

// InTree checks whether the file or directory referenced by leaf is contained within the
// directory tree pointed by root. All links (in either leaf or root) are resolved before checking
// for containment.
// One typical case in which this function is useful is to implement a jail in a virtual filesystem,
// to make sure that no I/O is performed outside of the specified root.
// On UNIX, InTree works across symlinks, hardlinks, and bind mounts.
// On Windows, InTree works across symlinks, hardlinks, junctions, drive mappings.
func InTree(leaf, root string) bool

It seems to me that these two functions should provide two primitives that can be used to solve canonicalization problems. I would be interested in hearing use cases which are not covered by these two functions but are solved by realpath or GetFinalPathNameByHandle.

@gopherbot gopherbot added this to the Proposal milestone Oct 25, 2020
@ericwj
Copy link

ericwj commented Oct 25, 2020

Go imho still has use for a (set of) function(s) that resolve links and a proper, safe and recommendable way when this is accepted. I hence don't see this as an alternative to #42201, but as a very useful addition.

Also I think this API will perform poorly, which is fine if used sparingly, but will add up checking lots of files or directories in a very deep directory structure. For the high volume calling scenario, I would add the fundamentals on top of which these functions are built to the public API for that reason, such that the application has the perhaps at least twice faster alternative available if it prefers, although it would make the application code slightly more complex.

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Oct 25, 2020
@ianlancetaylor
Copy link
Contributor

It seems to me that SameFile is the same as calling os.Stat twice followed by os.SameFile. Is there a more efficient way that it can be done in path/filepath?

@ericwj
Copy link

ericwj commented Oct 26, 2020

Indeed. InTree however will swamp Stat, potentially actually hitting the disk for it and that is totally not needed with some kind of context object that keeps track of a number of os.FileInfos for parent directories, or even just their file and volume ID. Using GetFileInformationByHandleEx with FileIdInfo instead of Stat on recent versions of Windows that have it will be even more efficient - and even required to automagically support ReFS volumes.

@rasky
Copy link
Member Author

rasky commented Oct 26, 2020

It seems to me that SameFile is the same as calling os.Stat twice followed by os.SameFile. Is there a more efficient way that it can be done in path/filepath?

Even if it was just a thin wrapper (which I'm not 100% sure yet), I think it's important to have it as part of path/filepath because equality of paths is a common scenario for which users will look for a solution there. I myself wasn't aware of the existence of os.SameFile until two weeks ago, and I've been importing third-party modules to have a realpath alternative to implement equality (even in a UNIX-only scenario) because I didn't know about it.

filepath.ToSlash and filepath.FromSlash are even thinner wrappers over strings.ReplaceAll so I think there's precedence there in wanting to provide common path manipulation functions irrespective of the actual complexity of building them on top of the rest of std.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

4 participants