Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/tools/gopls: test failures with invalid character '\x00' #59289

Closed
gopherbot opened this issue Mar 28, 2023 · 17 comments
Closed

x/tools/gopls: test failures with invalid character '\x00' #59289

gopherbot opened this issue Mar 28, 2023 · 17 comments
Assignees
Labels
gopls/incremental gopls Issues related to the Go language server, gopls. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Tools This label describes issues relating to any tools in the x/tools repository.
Milestone

Comments

@gopherbot
Copy link

gopherbot commented Mar 28, 2023

#!watchflakes
post <- pkg ~ `^golang\.org/x/tools` && `invalid character '.x00'`

Issue created automatically to collect these failures.

Example (log):

--- FAIL: TestLinks (24.26s)
    integration_test.go:444: gopls links -json a.go: exited with code 1, want success: true (gopls links -json a.go: exit=1 stdout=<<[
        	{
        		"range": {
        			"start": {
        				"line": 0,
        				"character": 24
        			},
        			"end": {
        				"line": 0,
...
        			"end": {
        				"line": 5,
        				"character": 41
        			}
        		},
        		"target": "https://blog.go.dev/"
        	}
        ]
        >> stderr=<<2023/03/28 04:05:52 internal error unmarshalling *[]cache.gobDiagnostic: invalid character '\x00' looking for beginning of value
        >>)

watchflakes

@gopherbot gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 28, 2023
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "golang.org/x/tools/gopls/internal/lsp/cmd/test" && test == "TestLinks"
2023-03-27 15:26 darwin-amd64-12_0 tools@e0ecd1bc go@a0c9d153 x/tools/gopls/internal/lsp/cmd/test.TestLinks (log)
--- FAIL: TestLinks (24.26s)
    integration_test.go:444: gopls links -json a.go: exited with code 1, want success: true (gopls links -json a.go: exit=1 stdout=<<[
        	{
        		"range": {
        			"start": {
        				"line": 0,
        				"character": 24
        			},
        			"end": {
        				"line": 0,
...
        			"end": {
        				"line": 5,
        				"character": 41
        			}
        		},
        		"target": "https://blog.go.dev/"
        	}
        ]
        >> stderr=<<2023/03/28 04:05:52 internal error unmarshalling *[]cache.gobDiagnostic: invalid character '\x00' looking for beginning of value
        >>)

watchflakes

@gopherbot gopherbot added Tools This label describes issues relating to any tools in the x/tools repository. gopls Issues related to the Go language server, gopls. labels Mar 28, 2023
@gopherbot gopherbot added this to the Unreleased milestone Mar 28, 2023
@findleyr
Copy link
Contributor

CC @adonovan, who might have ideas if this could be related to the filecache, but I suspect it may be my bug.

There have been a few of these causing test failures. I need to greplogs to confirm, but I think they're all related to diagnostics, in which case it suggests I missed handling an error somewhere (since this looks like corrupt data in the filecache).

@findleyr findleyr modified the milestones: Unreleased, gopls/v0.12.0 Mar 28, 2023
@adonovan adonovan self-assigned this Mar 30, 2023
@bcmills bcmills changed the title x/tools/gopls/internal/lsp/cmd/test: TestLinks failures x/tools/gopls: test failures with invalid character '\x00' Apr 3, 2023
@gopherbot
Copy link
Author

Found new dashboard test flakes for:

#!watchflakes
post <- pkg ~ `^golang\.org/x/tools` && `invalid character '.x00'`
2023-03-31 19:13 darwin-amd64-nocgo tools@65208701 go@99276aeb x/tools/gopls/internal/lsp/cmd/test.TestPrepareRename (log)
--- FAIL: TestPrepareRename (28.89s)
    integration_test.go:556: gopls prepare_rename a.go:1:9: stderr does not match [can't rename package]; got <<2023/03/31 23:30:28 internal error unmarshalling *[]cache.gobDiagnostic: invalid character '\x00' looking for beginning of value
        >>
    integration_test.go:561: gopls prepare_rename a.go:2:9: exited with code 1, want success: true (gopls prepare_rename a.go:2:9: exit=1 stdout=<<./a.go:2:6-13
        >> stderr=<<2023/03/31 23:30:34 internal error unmarshalling *[]cache.gobDiagnostic: invalid character '\x00' looking for beginning of value
        >>)

watchflakes

@adonovan
Copy link
Member

adonovan commented Apr 3, 2023

Strange. What we know:

  • this error message comes from cache.mustJSONDecode of []gobDiagnostic;
  • which is called only from decodeDiagnostics;
  • which is called (other than in tests) only from (*snapshot).PackageDiagnostics, with the result of filecache.Get(diagnosticsKind);
  • whose corresponding call to filecache.Set is in typeCheckBatch.checkPackage;
  • whose payload comes from encodeDiagnostics;
  • whose result comes from mustJSONEncode of a []gobDiagnostic.

The error message corresponds to a decode of a non-empty message starting with a NUL byte: https://go.dev/play/p/RQxCWnWZY90. But I can't see how the value in the cache can be anything but the successful result of encoding the correct data type. The filecache.Set and Get operations should be robust against incomplete writes.

@findleyr
Copy link
Contributor

findleyr commented Apr 5, 2023

Ok, so some more thoughts on this:

  • the two failures thus far have been in command tests, which run a short-lived gopls process
  • the two failures were on Darwin
  • diagnostics run asynchronously in the lsp session. These tests are not running anything related to diagnostics, but the diagnostics pass will run anyway, and may be terminated in the middle of running

All of these suggest to me that the fast termination of the cmd tests leads to filecache corruption when the process exits during a call to filecache.Set (or perhaps Gets occurring during shutdown get corrupted somehow?!). Since this has only occurred on Darwin thus far, maybe the filecache operations are not robust on Darwin.

@findleyr
Copy link
Contributor

findleyr commented Apr 5, 2023

Ok, I think I understand this.

Looking at the code, I don't think the lockedfile logic is actually robust to interrupted writes. It could certainly be terminated without finishing the write.

It looks like the Go command has handling for this, by looking specifically for null bytes resulting from the truncation.
e.g.
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/modfetch/fetch.go;l=743;drc=2f2c5e41e7728cae99abd14e8bd5ffcfebf9df8e

So we need a similar form of validation.

@bcmills @adonovan does that sound right?

@adonovan
Copy link
Member

adonovan commented Apr 6, 2023

So we need a similar form of validation.

That's the purpose of writing the length before the payload in Set:

https://cs.opensource.google/go/x/tools/+/master:gopls/internal/lsp/filecache/filecache.go;l=101-108?q=filecache&ss=go%2Fx%2Ftools

and checking that the read is exactly as long as the written length in Get:

https://cs.opensource.google/go/x/tools/+/master:gopls/internal/lsp/filecache/filecache.go;l=68-70?q=filecache&ss=go%2Fx%2Ftools

Are you suggesting that the write in fact writes the length prefix and expands the file size to the correct length but somehow fails to write the complete payload into it?

@findleyr
Copy link
Contributor

findleyr commented Apr 6, 2023

Are you suggesting that the write in fact writes the length prefix and somehow expands the file size to the correct length but somehow fails to write the complete payload into it?

I was deducing from the presence of logic to handle this in the Go command (which also uses lockedfile) that it may be possible for a lockedfile write to result in a corrupted file that contains null bytes. That would mean we have a file of length N, where some byte in [0, N) is null. I don't a priori know why we could conclude that N != len(data).

It may also be worth noting that almost all of these payloads are JSON "null", a byte slice of length 4. Given their short length, perhaps it is more likely that a corrupted write actually has the correct size.

@findleyr
Copy link
Contributor

findleyr commented Apr 6, 2023

Note also that when using io.MultiReader, the header and payload will be written in separate writes.

Furthermore, there could be many concurrent processes writing this file, so it seems likely that we are overwriting an existing copy with identical content.

So I think the necessary conditions could be that the first write succeeds to write the length (4, in the common case), and the second write increases the file size but fails to write the data. I didn't know that could happen, but "once you eliminate the impossible...", etc. Either that, or the truncate partially succeeds?!

@adonovan
Copy link
Member

adonovan commented Apr 6, 2023

I was deducing from the presence of logic to handle this in the Go command (which also uses lockedfile) that it may be possible for a lockedfile write to result in a corrupted file that contains null bytes.

The linked comment:

// isValidSum returns true if data is the valid contents of a zip hash file.
// Certain critical files are written to disk by first truncating
// then writing the actual bytes, so that if the write fails
// the corrupt file should contain at least one of the null
// bytes written by the truncate operation.

talks about null bytes written by the truncate operation, but the argument to ftruncate from the call in lockedfile is zero, meaning that it never extends a file with zeroes. I do not expect ftruncate(0) to cause reads through an existing descriptor to return zeros (they should return EOF). Nor would I expect there to be concurrent reads through existing descriptors since the whole point of filelock is that the operations should be mutually exclusive.

I don't think the lockedfile logic is actually robust to interrupted writes. It could certainly be terminated without finishing the write.

Yes, I think this issue is evidence of a bug in lockedfile, but I still don't understand it. Even in the presence of concurrent access I would not expect a file that has been truncated to length zero and then written with N bytes to be observable in a middle-state in which only M < N of the bytes have been written and the remaining bytes are zero. That would be a patent kernel bug.

So we need a similar form of validation.

We could easily implement checksums but I'd much rather understand and fix the bug in lockedfile than add yet another layer of workarounds.

@adonovan
Copy link
Member

adonovan commented Apr 6, 2023

Note also that when using io.MultiReader, the header and payload will be written in separate writes.

Yes, but that's ok. The integrity invariant is preserved at all times.

Furthermore, there could be many concurrent processes writing this file, so it seems likely that we are overwriting an existing copy with identical content.

No, filelock should exclude such concurrency. That is its one job.

So I think the necessary conditions could be that the first write succeeds to write the length (4, in the common case), and the second write increases the file size but fails to write the data. I didn't know that could happen, but "once you eliminate the impossible...", etc.

My point is: that should never happen.

Perhaps @bcmills has more ideas.

@findleyr
Copy link
Contributor

findleyr commented Apr 6, 2023

I would not expect a file that has been truncated to length zero and then written with N bytes to be observable in a middle-state in which only M < N of the bytes have been written and the remaining bytes are zero. That would be a patent kernel bug.

Neither would I.

No, filelock should exclude such concurrency. That is its one job.

I didn't mean to say that the writes are concurrent. I mean that the data previously stored would have had the same length, if that makes any difference.

My point is: that should never happen.

But it does happen!

@bcmills
Copy link
Contributor

bcmills commented Apr 6, 2023

I would not expect a file that has been truncated to length zero and then written with N bytes to be observable in a middle-state in which only M < N of the bytes have been written and the remaining bytes are zero. That would be a patent kernel bug.

Filesystem bugs on darwin in particular are unfortunately not uncommon. See previously:

Actually, come to think of it #52493 looks like an almost exact match for this failure mode, except that it's a different OS version. 🤔

@bcmills
Copy link
Contributor

bcmills commented Apr 6, 2023

Perhaps @bcmills has more ideas.

I would suggest two mitigations:

  1. Write the length stamp at the end of the file instead of (or in addition to) the beginning, or perhaps also write a checksum for the contents. That way, if the end of a cache file is corrupted with NUL bytes, we stand a better chance of detecting and rejecting it.
  2. Ensure that the gopls and go processes throughout the tests (and throughout the real LSP deployments!) are always terminated gracefully (by sending SIGTERM or SIGINT or closing the JSON-RPC2 I/O streams) rather than killed unrecoverably with SIGKILL. (Notably, that probably means using the new exec.Cmd.Cancel field where available.)

@adonovan
Copy link
Member

adonovan commented Apr 6, 2023

Write the length stamp at the end of the file instead of (or in addition to) the beginning, or perhaps also write a checksum for the contents. That way, if the end of a cache file is corrupted with NUL bytes, we stand a better chance of detecting and rejecting it.

Thanks, will do.

Ensure that the gopls and go processes throughout the tests (and throughout the real LSP deployments!) are always terminated gracefully (by sending SIGTERM or SIGINT or closing the JSON-RPC2 I/O streams) rather than killed unrecoverably with SIGKILL. (Notably, that probably means using the new exec.Cmd.Cancel field where available.)

Not sure I agree with this one: our programs must be prepared for sudden death (i.e., be idempotent and transactional in their updates of persistent state) since they will have to deal with this in actual use. It's easier to detect bugs--such as this one--when we don't build in layers of imperfect mitigation such as "graceful shutdown". kill -KILL should be an acceptable way to terminate the process.

@gopherbot
Copy link
Author

Change https://go.dev/cl/482818 mentions this issue: gopls/internal/lsp/filecache: add integrity check

gopherbot pushed a commit to golang/tools that referenced this issue Apr 6, 2023
This change appends a 32-bit CRC checksum to the end of
each cache file in Set, and checks it in Get, and dumps
the file content if it doesn't match the content.

This is intended to help investigate golang/go#59289.

Updates golang/go#59289

Change-Id: Ia439d915c37d33b0af5d0da11a8db63762cd80a7
Reviewed-on: https://go-review.googlesource.com/c/tools/+/482818
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Robert Findley <rfindley@google.com>
Auto-Submit: Alan Donovan <adonovan@google.com>
Run-TryBot: Alan Donovan <adonovan@google.com>
@findleyr
Copy link
Contributor

findleyr commented May 3, 2023

Fixed by https://go.dev/cl/492035. Not sure what else we can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gopls/incremental gopls Issues related to the Go language server, gopls. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Tools This label describes issues relating to any tools in the x/tools repository.
Projects
Status: Done
Development

No branches or pull requests

4 participants