Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/token: add FileSet.RemoveFile method #53200

Closed
adonovan opened this issue Jun 2, 2022 · 12 comments
Closed

go/token: add FileSet.RemoveFile method #53200

adonovan opened this issue Jun 2, 2022 · 12 comments

Comments

@adonovan
Copy link
Member

adonovan commented Jun 2, 2022

Background: A token.FileSet holds a collection of non-overlapping token.Files and provides a mapping from token.Pos integers to position information (file/line/column/offset). This is analogous to the way an address space containing several non-overlapping file mappings defines the meaning of a numeric pointer value. Applications that parse Go files are expected to use a single FileSet to provide meaning to all the token.Pos values in all the ASTs; in this way, the AST nodes themselves needn't include a pointer to the mapping information, as it can be implicitly supplied by the contextual FileSet.

The problem: this design leads applications to create a single FileSet that is effectively a ubiquitous global variable. A long-running application may encounter and parse an unbounded stream of files, calling AddFile for each one, causing its set of Files to grow without bound. The FileSet retains each File indefinitely, even after the application no longer cares about it. The memory footprint of a File is approximately one int per line of the file. Go source files in the standard library have an average of 371 lines, leading to 3KB of wasted space per file. Our application (gopls) parses a file at least once per keystroke of a Go developer's editor.

I can think of a few ways the memory usage of FileSet could be reduced.

The first approach is to move away from the concept of a single central FileSet in an application. Instead one would create a new FileSet for every parse, so that FileSets and Files are typically in 1:1 correspondence. An application could, as needed, create transient FileSets containing only the small subset of existing Files necessary for a given operation, such as invoking the type checker. However, this would still require all those Files to be non-overlapping, which requires central management of the "next free address" variable (like sbrk(2) in the mmap analogy) that should be provided in each call to FileSet.AddFile(base). It would also require that every call to Parse go through this central management, which might be practical within an application but doesn't seem feasible if libraries with published APIs are considered.

A second approach would be for the parser to save the token.File in every node of the AST so that there is never any need to maintain or consult an external FileSet data structure: each AST node's (*File, Pos) information would be complete. This approach however requires that every place that today uses a token.Pos to refer to a file position (relative to some implied FileSet) would need to be changed to a (*File, Pos) pair. For example, in errors from the type checker, and in diagnostics from the analysis framework. This seems like a major incompatible change to existing libraries.

The third, and simplest, approach, is to add a (*FileSet).RemoveFile(*File) method that removes a File from the set so that subsequent queries such as Pos, File, and Iterate return a negative result---like munmap in the mmap analogy. The liberated portion of the address space would not be re-used, but the File's memory could be reclaimed. As with munmap, an application would be responsible for remembering to call RemoveFile at an appropriate time, and not too soon. I propose we take this approach. The change to FileSet is very simple and low-risk.

@findleyr @griesemer

@gopherbot
Copy link

Change https://go.dev/cl/410114 mentions this issue: go/token: add (*FileSet).RemoveFile(*File) method

@adonovan
Copy link
Member Author

adonovan commented Jun 2, 2022

A fourth approach (a variant on the first): FileSet.base is replaced by an actual global variable in the token package, shared by all FileSets; the base value provided to FileSet.AddFile is ignored. This ensures that all token.Files are inherently non-overlapping, permitting any subset of them to be collected in a transient FileSet. In other words, any collection of FileSets would be consistent in their (successful) Pos->Position mappings, even if they represent different subsets of files. This solves the "central management" problem.

In theory, a downside of this approach is that the global counter would grow faster than each individual FileSet.base, but (a) in practice today there is only one FileSet in the application (see above!), and (b) even if that were not the case, we would still be in no danger of overflowing an int---not on a 64-bit machine at least.

@timothy-king
Copy link
Contributor

Another variant of the first could be to explicitly share a base value between FileSets. The FileSet could carry this around. The existing FileSet APIs could start from an unshared base. This is essentially just the fourth option but pass around a [thread safe?] *int on FileSet creation instead of using a global. Like the fourth option the really safe way to share a base with multiple competing parallel FileSets is to pass in a negative base to AddFile.

Not sure either this or the fourth option would be okay on 32-bit machine for long running processes like gopls.

@adonovan
Copy link
Member Author

adonovan commented Jun 2, 2022

Another variant of the first could be to explicitly share a base value between FileSets. The FileSet could carry this around. The existing FileSet APIs could start from an unshared base. This is essentially just the fourth option but pass around a [thread safe?] *int on FileSet creation instead of using a global. Like the fourth option the really safe way to share a base with multiple competing parallel FileSets is to pass in a negative base to AddFile.

The value of approach 4 is it solves the "central management" problem without needing to update every caller of Parse, whereas the explicit-shared-based approach would still require each caller of Parse to participate.

Not sure either this or the fourth option would be okay on 32-bit machine for long running processes like gopls.

On a 32-bit machine, gopls would run out of memory long before it could even attempt to parse 2GiB worth of Go source files. ;-)

@timothy-king
Copy link
Contributor

My understanding is that we are worried about 1 long running process (gopls) creating a File per keystroke and exhausting the available Pos values. If this understanding is wrong, the rest of this comment won't mean much.

I did a back of the envelope estimate, which came out to about 16 hours of work to overflow a global base on a 32-bit machine. 1<<31 Positions * 1/(11074 Positions/AddFile)* 1 keystroke/AddFile * 1/(200 keystroke/minute) which is roughly 16 hours. 11074 comes from the average file size in the Go standard library. 1 keystroke/AddFile comes from my understanding of "parses a file at least once per keystroke of a Go developer's editor". 200 keystrokes/minute is a guess from CPM type speed estimates (https://www.toptenreviews.com/what-is-a-good-typing-speed). Unless the assumptions are +1 order of magnitude off, seems like not enough time for a gopls process.

@adonovan
Copy link
Member Author

adonovan commented Jun 2, 2022

My understanding is that we are worried ... exhausting the available Pos values.

Your analysis is correct but, no, I'm not at all worried about running out of address space. No-one has ever worked at 200 keystrokes/minute for 16 hours straight, and I'm pretty sure gopls would crash at least once if they tried. It also needs more than 2GB of RAM for most workloads, and no laptop is a 32-bit machine.

What I'm worried about is actual memory consumed by the newline index tables.

@findleyr
Copy link
Contributor

findleyr commented Jun 2, 2022

For arguments sake, keep in mind that it's not the average file size that matters, but the size of the file that is changing. I am sure we could find a real world example of a very large file in an otherwise small workspace, that exhausts the 32 bit address space in under an hour (or perhaps a few minutes!). But as Alan suggests, this development scenario is unlikely for a variety of reasons, and in practice such a workflow would encounter other problems.

@ianlancetaylor ianlancetaylor changed the title go/token: FileSet grows without bound proposal: go/token: add FileSet.RemoveFile method Jun 8, 2022
@gopherbot gopherbot added this to the Proposal milestone Jun 8, 2022
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Jun 8, 2022
@rsc
Copy link
Contributor

rsc commented Jun 15, 2022

Adding to minutes. It does seem like there are many many parts of the go/* packages that assume batch mode and are stressed near breaking by interactive use. I wonder if we should be looking at all of them together instead of just one API at a time.

@rsc rsc moved this from Incoming to Active in Proposals (old) Jun 15, 2022
@rsc
Copy link
Contributor

rsc commented Jun 15, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@findleyr
Copy link
Contributor

It does seem like there are many many parts of the go/* packages that assume batch mode and are stressed near breaking by interactive use.

That's definitely true, but I think we should decouple this proposal from larger changes. It will be a long time before we can stop using token.FileSet in gopls, and yet this unavoidable memory leak has very real consequences for our users today. (Consider that, while working on a large file with ~10k lines, gopls can leak 100kb per keystroke).

@rsc
Copy link
Contributor

rsc commented Jul 1, 2022

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc rsc moved this from Active to Likely Accept in Proposals (old) Jul 1, 2022
@rsc rsc moved this from Likely Accept to Accepted in Proposals (old) Jul 13, 2022
@rsc
Copy link
Contributor

rsc commented Jul 13, 2022

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: go/token: add FileSet.RemoveFile method go/token: add FileSet.RemoveFile method Jul 13, 2022
@rsc rsc modified the milestones: Proposal, Backlog Jul 13, 2022
@adonovan adonovan self-assigned this Aug 15, 2022
@golang golang locked and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

6 participants