x/tools/go/analysis: facts should be shareable between analyses #29616

dominikh · 2019-01-08T17:51:18Z

go/analysis compares facts to export data in traditional compilation. However, it declares facts as private to an analysis, meaning a single analysis must produce and consume facts, and no other analyses can consume these facts.

I find this very limiting. Many facts are useful to more than one analysis. For example, staticcheck computes these "facts" (albeit in a different framework) for all functions: purity and whether a function
is supposed to terminate. These facts don't serve one specific analysis, but are rather used by many analyses to either find bugs, or eliminate false positives. Similarly, the "printf wrapper" fact by vet
could prove useful to other analyses.

I believe that go/analysis currently offers no workaround for this. The only API for sharing data between analyses are results. Results are intended for sharing information concerning a single package
between analyses. One could imagine (ab)using results for sharing facts (breaking the assumption that results only concern a single package). However, this is not currently feasible. For one, no
function for enumerating all facts known to an analysis exists. Secondly, even if such a function did exist, making all facts available via a result (likely a map) would be a costly
materialization of said facts (which may be cached on disk, and of which only a small subset may be accessed by any analysis).

Furthermore, the current API discourages people from making facts reusable in the first place. I claim that the writers of the printf check haven't thought about making their facts available via results, strictly considering them an implementation detail.

I propose changing the semantics of facts. They should be shared between analyses, and people should be encouraged to write stand-alone fact-finding analyses. People who do not want to make their facts publicly available, or only use them as an artifact in a single analysis, can continue not exporting the types.

I do not believe that expanding the available scope of facts will have a negative effect on performance. Facts can still be shared between address spaces and remain serializable.

/cc @ianthehat @matloob @alandonovan @mvdan @myitcv

alandonovan · 2019-01-08T18:42:08Z

The only API for sharing data between analyses are results. Results are intended for sharing information concerning a single package between analyses. One could imagine (ab)using results for sharing facts (breaking the assumption that results only concern a single package). However, this is not currently feasible. For one, no function for enumerating all facts known to an analysis exists. Secondly, even if such a function did exist, making all facts available via a result (likely a map) would be a costly
materialization of said facts (which may be cached on disk, and of which only a small subset may be accessed by any analysis).

Sharing results derived from facts is not an abuse, it's the intended way to do what you want in the current design, and this was deliberate. Results are the API that one analyzer presents to another. Results are limited to "information about a single package", but that also includes the public API of each package that it directly imports, plus parts of many other packages indirectly accessible through fields and methods. It should be enough to support most type-based analyses.

That there is no way to enumerate facts does seem like a real limitation, so perhaps we should add an iterator for them.

I appreciate and shared your concern about the cost of materializing all facts, but I could not find a way to implement fact serialization and merging that did not materialize all facts, so I had to let it go. At least today, facts are so few and so small that this has yet to be a noticeable cost, though of course that could change as they are adopted.

dominikh · 2019-01-08T19:56:25Z

Wouldn't merging only require a list of keys, as opposed to all gob-decoded values?

alandonovan · 2019-01-08T20:36:11Z

Yes, I suppose it should. But the expensive part of merging, at least today when facts are few but packages are potentially quite large, is enumerating the set of keys (objects) that are part of the API of the current package; see golang.org/x/tools/go/analysis/internal/facts/imports.go.

dominikh · 2019-01-08T20:58:56Z

As a point of reference, for staticcheck std, staticcheck computes 733 facts (94 pure functions, 380 stubs, 43 functions that don't terminate, 216 deprecated objects) More facts will be added in the future.

I suppose thousands of facts, most of which are boolean, aren't a lot.

As for having to enumerate the set of keys, however: Why is this necessary? I get why it'll be necessary if we add a way of iterating over all facts. However, lacking this, all that is needed is a file format with individually addressable facts to get and set individual facts on demand?

dominikh · 2019-01-23T13:28:26Z

@alandonovan Is there any chance of me convincing you that facts should not be private, or should we move forward with adding an API for iterating over facts?

alandonovan · 2019-01-23T14:26:26Z

Sorry for the delay. I meant to think on this, but got distracted.

Making facts public does not increase the expressiveness of the API: one can accomplish the same things by exposing a result whose implementation relies on facts. So I have yet to hear a compelling reason to expose facts directly, but perhaps I misunderstand what you have in mind as an alternative.

all that is needed is a file format with individually addressable facts to get and set individual facts on demand

Not every driver implementation today can be assumed to use a serial file format. The single/multichecker implementations use an in-memory fact store, for example. Perhaps it would be no big loss if a change to the API forced all drivers to use an indexable file format as an intermediary.

An indexable file format might make random access cheaper for the analysis, especially if it uses only a few facts, but the driver will still need to visit all the facts from package P and each direct import of P in order to emit the resulting fact set for P; see https://github.com/golang/tools/blob/master/go/analysis/internal/facts/facts.go#L209. So I don't see how it makes things asymptotically cheaper.

Could you elaborate on the API and file format you have in mind?

I'm also happy to move forward with an API for iterating over facts. Do you want to sketch an implementation?

dominikh · 2019-01-23T16:06:57Z

Making facts public does not increase the expressiveness of the API: one can accomplish the same things by exposing a result whose implementation relies on facts. So I have yet to hear a compelling reason to expose facts directly, but perhaps I misunderstand what you have in mind as an alternative.

It would not make the API more expressive, however it would make it simpler to use, and reduce the API surface of analyses. Instead of having to populate a Result, in addition to collecting facts, a user would only need to export the fact type. The decision between public and private would be a single character, much like in Go's type system, instead of having to write a loop and document a result type for each fact that one wishes to make public. Furthermore, if a single check sets multiple facts, the Result type (again likely to be a map) would require a bit field or struct type that wouldn't be necessary if facts were public.

To summarize, I think it should be much simpler to switch between private and public facts.

but the driver will still need to visit all the facts from package P and each direct import of P in order to emit the resulting fact set for P

I completely glossed over the fact that each package's export data would fully describe its dependencies (like in the compiler), hence my confusion as to why all facts had to be collected. Of course we are assuming that every implementation of fact stores follows our idea of self-contained export data. Other implementations could opt to use linked lists, or actual databases (relational or otherwise), in case people want to optimize for random access.

Not every driver implementation today can be assumed to use a serial file format. The single/multichecker implementations use an in-memory fact store, for example. Perhaps it would be no big loss if a change to the API forced all drivers to use an indexable file format as an intermediary.

The specific file format (if any) isn't important, only that it supports certain operations: random access, possibly iteration, a way to remove all facts associated with an object or package (probably a subset of iteration). The in-memory store trivially supports this, as does self-contained export data that is loaded into memory.

I'm also happy to move forward with an API for iterating over facts. Do you want to sketch an implementation?

The specifics of that API depend on the outcome of my push for random access, so let's hold off on that for now.

matloob · 2019-03-15T21:46:06Z

Hello!

I'm going to try to get a better idea of the pros/cons of this next week.

If I understand correctly, it seems to me that the main "con" are that each of the {package x analysis} is now responsible for propagating all the facts in its transitive set of analysis dependencies for the types in the package's export data and the package itself, and analyses no longer have control over what they propagate downstream.

I'm a little concerned about that, though I'm not sure how much of an issue it can be. But I'd rather lean on the side of being conservative about changing the API.

matloob · 2019-03-19T18:46:16Z

@dominikh If I understand correctly, the main reason to make facts sharable between analyses is as a convenience: Pass already contains a mapping from objects or facts, so sharing facts between analyses means there wouldn't be a need to create a new result type and map facts to analyses. So facts can be set and read more easily, and exported more easily.

Would you agree with that?

I think we might be able to keep the distinction between Facts and Results while still making it more convenient for users to export facts, maybe by creating a "common" Result type that makes facts available or something else like that. Maybe there wouldn't be a one character change to make facts visible to other analyses, but it could be a couple of lines.

Given that, I'd prefer to err on the side of not "expanding" the analysis API by making facts sharable if we don't have a stronger reason to do so.

One idea suggested by Alan is that if we could see what analyses would look like side-by-side written using the current Fact/Result mechanism, and written with "sharable" facts, we might be able to suggest helpers or improvements to make the Fact/Result mechanism easier to use.

dominikh · 2019-03-19T21:01:02Z

Would you agree with that?

Yes. This is about convenience, and a specific mental model. Any analysis using my model (shareable facts) can also be implemented in terms of results. This does, however, still require expanding the API to add a missing part: an analysis being able to enumerate all its facts, in order to construct the result type.

And yes, we can absolutely build helpers to make the process of turning facts into results less painful. If that's the way the majority wants to go, I will concede. I just prefer my approach :-)

matloob · 2019-03-20T21:56:06Z

Yes, we'd have to be able to enumerate all facts to make the result type. We should definitely do that.

I'm not sure which way the majority wants to go. I'd be interested in hearing others' opinions about this. But I agree with Alan that we'd need more evidence to justify sharing facts.

Edit: this might be worth bringing up at the tools sync...

gopherbot · 2019-04-29T20:35:41Z

Change https://golang.org/cl/174379 mentions this issue: go/analysis: proposed fact iteration API

Updates golang/go#29616 Change-Id: Ibaf10526ea35f06b853ad55451a50750c764fab4 Reviewed-on: https://go-review.googlesource.com/c/tools/+/174379 Run-TryBot: Michael Matloob <matloob@golang.org> Reviewed-by: Daniel Martí <mvdan@mvdan.cc> Reviewed-by: Ian Cottrell <iancottrell@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

aaronbee · 2019-06-18T18:23:00Z

My experience report: I'm trying to write an analysis tool that applies an additional check on all Print/Printf-like functions. This check is unique to my organization's code base. The printf analyzer does the work of marking all the functions I would want to validate with the printf.isWrapper fact. Since this is unexported my only option was to copy the code that applies this fact into my tool.

If the printf.isWrapper type were exported then I could just build my tool using the printf analyzer as a Requires and use pass.ImportObjectFact to discover which calls were print wrappers.

I believe https://golang.org/cl/174379 gets me what I need somewhat awkwardly. Presumably I would iterate over all the object facts, and take action when the fact implements fmt.Stringer and Fact.String() returns "printfWrapper" or "printWrapper" (this is what isWrapper.String() returns).

alandonovan · 2019-06-18T19:27:50Z

According to the design, the intended way to reuse the results computed by the printf analyzer is not to depend on its facts, but to make it return a result that is an opaque object with an IsWrapper(f *types.Func) method. The implementation of that method would test whether f is a member of a set that the analyzer populates at the same moment that it exports a fact about f:

	// printf.go:262
	fn := w.obj
	var fact isWrapper
	if !pass.ImportObjectFact(fn, &fact) {
		fact.Printf = kind == kindPrintf
		pass.ExportObjectFact(fn, &fact) // here
		for _, caller := range w.callers {
			checkPrintfFwd(pass, caller.w, caller.call, kind)
		}

Facts are how one printf analysis serially communicates with another. Results are how a printf analysis exposes its deductions to other analyzers. It may seem redundant the same information is in both, but that's just this case; in general the structures of the two pieces of information may be quite different.

@alandonovan

printf.Result has IsPrint function telling the caller if a function is a Print/Printf function or a wrapper of one. This aids in developing Ananlyzer's applying checks on Print/Printf functions. Implements @alandonovan suggestion from golang/go#29616 (comment) Change-Id: I28219337235acb025c50609bed9e1709ad68f5f1

@alandonovan

printf.Result has IsPrint function telling the caller if a function is a Print/Printf function or a wrapper of one. This aids in developing Ananlyzer's applying checks on Print/Printf functions. Implements @alandonovan suggestion from golang/go#29616 (comment) Change-Id: I57833d5c65c417f2e4a761d9187fc133937f39fb

@alandonovan

printf.Result has IsPrint function telling the caller if a function is a Print/Printf function or a wrapper of one. This aids in developing Ananlyzer's applying checks on Print/Printf functions. Implements @alandonovan suggestion from golang/go#29616 (comment) Change-Id: I57833d5c65c417f2e4a761d9187fc133937f39fb

@alandonovan