x/tools/gopls: test failures with `invalid character '\x00'` #59289

gopherbot · 2023-03-28T11:20:14Z

#!watchflakes
post <- pkg ~ `^golang\.org/x/tools` && `invalid character '.x00'`

Issue created automatically to collect these failures.

Example (log):

--- FAIL: TestLinks (24.26s)
    integration_test.go:444: gopls links -json a.go: exited with code 1, want success: true (gopls links -json a.go: exit=1 stdout=<<[
        	{
        		"range": {
        			"start": {
        				"line": 0,
        				"character": 24
        			},
        			"end": {
        				"line": 0,
...
        			"end": {
        				"line": 5,
        				"character": 41
        			}
        		},
        		"target": "https://blog.go.dev/"
        	}
        ]
        >> stderr=<<2023/03/28 04:05:52 internal error unmarshalling *[]cache.gobDiagnostic: invalid character '\x00' looking for beginning of value
        >>)

— watchflakes

The text was updated successfully, but these errors were encountered:

gopherbot · 2023-03-28T11:20:15Z

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "golang.org/x/tools/gopls/internal/lsp/cmd/test" && test == "TestLinks"

2023-03-27 15:26 darwin-amd64-12_0 tools@e0ecd1bc go@a0c9d153 x/tools/gopls/internal/lsp/cmd/test.TestLinks (log)

--- FAIL: TestLinks (24.26s)
    integration_test.go:444: gopls links -json a.go: exited with code 1, want success: true (gopls links -json a.go: exit=1 stdout=<<[
        	{
        		"range": {
        			"start": {
        				"line": 0,
        				"character": 24
        			},
        			"end": {
        				"line": 0,
...
        			"end": {
        				"line": 5,
        				"character": 41
        			}
        		},
        		"target": "https://blog.go.dev/"
        	}
        ]
        >> stderr=<<2023/03/28 04:05:52 internal error unmarshalling *[]cache.gobDiagnostic: invalid character '\x00' looking for beginning of value
        >>)

— watchflakes

findleyr · 2023-03-28T14:31:14Z

CC @adonovan, who might have ideas if this could be related to the filecache, but I suspect it may be my bug.

There have been a few of these causing test failures. I need to greplogs to confirm, but I think they're all related to diagnostics, in which case it suggests I missed handling an error somewhere (since this looks like corrupt data in the filecache).

gopherbot · 2023-04-03T14:24:53Z

Found new dashboard test flakes for:

#!watchflakes
post <- pkg ~ `^golang\.org/x/tools` && `invalid character '.x00'`

2023-03-31 19:13 darwin-amd64-nocgo tools@65208701 go@99276aeb x/tools/gopls/internal/lsp/cmd/test.TestPrepareRename (log)

--- FAIL: TestPrepareRename (28.89s)
    integration_test.go:556: gopls prepare_rename a.go:1:9: stderr does not match [can't rename package]; got <<2023/03/31 23:30:28 internal error unmarshalling *[]cache.gobDiagnostic: invalid character '\x00' looking for beginning of value
        >>
    integration_test.go:561: gopls prepare_rename a.go:2:9: exited with code 1, want success: true (gopls prepare_rename a.go:2:9: exit=1 stdout=<<./a.go:2:6-13
        >> stderr=<<2023/03/31 23:30:34 internal error unmarshalling *[]cache.gobDiagnostic: invalid character '\x00' looking for beginning of value
        >>)

— watchflakes

adonovan · 2023-04-03T15:22:13Z

Strange. What we know:

this error message comes from cache.mustJSONDecode of []gobDiagnostic;
which is called only from decodeDiagnostics;
which is called (other than in tests) only from (*snapshot).PackageDiagnostics, with the result of filecache.Get(diagnosticsKind);
whose corresponding call to filecache.Set is in typeCheckBatch.checkPackage;
whose payload comes from encodeDiagnostics;
whose result comes from mustJSONEncode of a []gobDiagnostic.

The error message corresponds to a decode of a non-empty message starting with a NUL byte: https://go.dev/play/p/RQxCWnWZY90. But I can't see how the value in the cache can be anything but the successful result of encoding the correct data type. The filecache.Set and Get operations should be robust against incomplete writes.

findleyr · 2023-04-05T22:19:16Z

Ok, so some more thoughts on this:

the two failures thus far have been in command tests, which run a short-lived gopls process
the two failures were on Darwin
diagnostics run asynchronously in the lsp session. These tests are not running anything related to diagnostics, but the diagnostics pass will run anyway, and may be terminated in the middle of running

All of these suggest to me that the fast termination of the cmd tests leads to filecache corruption when the process exits during a call to filecache.Set (or perhaps Gets occurring during shutdown get corrupted somehow?!). Since this has only occurred on Darwin thus far, maybe the filecache operations are not robust on Darwin.

findleyr · 2023-04-05T22:35:02Z

Ok, I think I understand this.

Looking at the code, I don't think the lockedfile logic is actually robust to interrupted writes. It could certainly be terminated without finishing the write.

It looks like the Go command has handling for this, by looking specifically for null bytes resulting from the truncation.
e.g.
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/modfetch/fetch.go;l=743;drc=2f2c5e41e7728cae99abd14e8bd5ffcfebf9df8e

So we need a similar form of validation.

@bcmills @adonovan does that sound right?

adonovan · 2023-04-06T01:35:36Z

So we need a similar form of validation.

That's the purpose of writing the length before the payload in Set:

https://cs.opensource.google/go/x/tools/+/master:gopls/internal/lsp/filecache/filecache.go;l=101-108?q=filecache&ss=go%2Fx%2Ftools

and checking that the read is exactly as long as the written length in Get:

https://cs.opensource.google/go/x/tools/+/master:gopls/internal/lsp/filecache/filecache.go;l=68-70?q=filecache&ss=go%2Fx%2Ftools

Are you suggesting that the write in fact writes the length prefix and expands the file size to the correct length but somehow fails to write the complete payload into it?

findleyr · 2023-04-06T12:13:12Z

Are you suggesting that the write in fact writes the length prefix and somehow expands the file size to the correct length but somehow fails to write the complete payload into it?

I was deducing from the presence of logic to handle this in the Go command (which also uses lockedfile) that it may be possible for a lockedfile write to result in a corrupted file that contains null bytes. That would mean we have a file of length N, where some byte in [0, N) is null. I don't a priori know why we could conclude that N != len(data).

It may also be worth noting that almost all of these payloads are JSON "null", a byte slice of length 4. Given their short length, perhaps it is more likely that a corrupted write actually has the correct size.

findleyr · 2023-04-06T13:07:35Z

Note also that when using io.MultiReader, the header and payload will be written in separate writes.

Furthermore, there could be many concurrent processes writing this file, so it seems likely that we are overwriting an existing copy with identical content.

So I think the necessary conditions could be that the first write succeeds to write the length (4, in the common case), and the second write increases the file size but fails to write the data. I didn't know that could happen, but "once you eliminate the impossible...", etc. Either that, or the truncate partially succeeds?!

adonovan · 2023-04-06T13:15:12Z

I was deducing from the presence of logic to handle this in the Go command (which also uses lockedfile) that it may be possible for a lockedfile write to result in a corrupted file that contains null bytes.

The linked comment:

// isValidSum returns true if data is the valid contents of a zip hash file.
// Certain critical files are written to disk by first truncating
// then writing the actual bytes, so that if the write fails
// the corrupt file should contain at least one of the null
// bytes written by the truncate operation.

talks about null bytes written by the truncate operation, but the argument to ftruncate from the call in lockedfile is zero, meaning that it never extends a file with zeroes. I do not expect ftruncate(0) to cause reads through an existing descriptor to return zeros (they should return EOF). Nor would I expect there to be concurrent reads through existing descriptors since the whole point of filelock is that the operations should be mutually exclusive.

I don't think the lockedfile logic is actually robust to interrupted writes. It could certainly be terminated without finishing the write.

Yes, I think this issue is evidence of a bug in lockedfile, but I still don't understand it. Even in the presence of concurrent access I would not expect a file that has been truncated to length zero and then written with N bytes to be observable in a middle-state in which only M < N of the bytes have been written and the remaining bytes are zero. That would be a patent kernel bug.

So we need a similar form of validation.

We could easily implement checksums but I'd much rather understand and fix the bug in lockedfile than add yet another layer of workarounds.

adonovan · 2023-04-06T13:19:17Z

Note also that when using io.MultiReader, the header and payload will be written in separate writes.

Yes, but that's ok. The integrity invariant is preserved at all times.

Furthermore, there could be many concurrent processes writing this file, so it seems likely that we are overwriting an existing copy with identical content.

No, filelock should exclude such concurrency. That is its one job.

So I think the necessary conditions could be that the first write succeeds to write the length (4, in the common case), and the second write increases the file size but fails to write the data. I didn't know that could happen, but "once you eliminate the impossible...", etc.

My point is: that should never happen.

Perhaps @bcmills has more ideas.

findleyr · 2023-04-06T13:27:58Z

I would not expect a file that has been truncated to length zero and then written with N bytes to be observable in a middle-state in which only M < N of the bytes have been written and the remaining bytes are zero. That would be a patent kernel bug.

Neither would I.

No, filelock should exclude such concurrency. That is its one job.

I didn't mean to say that the writes are concurrent. I mean that the data previously stored would have had the same length, if that makes any difference.

My point is: that should never happen.

But it does happen!

bcmills · 2023-04-06T14:48:00Z

I would not expect a file that has been truncated to length zero and then written with N bytes to be observable in a middle-state in which only M < N of the bytes have been written and the remaining bytes are zero. That would be a patent kernel bug.

Filesystem bugs on darwin in particular are unfortunately not uncommon. See previously:

Actually, come to think of it #52493 looks like an almost exact match for this failure mode, except that it's a different OS version. 🤔

bcmills · 2023-04-06T15:05:41Z

Perhaps @bcmills has more ideas.

I would suggest two mitigations:

Write the length stamp at the end of the file instead of (or in addition to) the beginning, or perhaps also write a checksum for the contents. That way, if the end of a cache file is corrupted with NUL bytes, we stand a better chance of detecting and rejecting it.
Ensure that the gopls and go processes throughout the tests (and throughout the real LSP deployments!) are always terminated gracefully (by sending SIGTERM or SIGINT or closing the JSON-RPC2 I/O streams) rather than killed unrecoverably with SIGKILL. (Notably, that probably means using the new exec.Cmd.Cancel field where available.)

adonovan · 2023-04-06T17:28:23Z

Write the length stamp at the end of the file instead of (or in addition to) the beginning, or perhaps also write a checksum for the contents. That way, if the end of a cache file is corrupted with NUL bytes, we stand a better chance of detecting and rejecting it.

Thanks, will do.

Ensure that the gopls and go processes throughout the tests (and throughout the real LSP deployments!) are always terminated gracefully (by sending SIGTERM or SIGINT or closing the JSON-RPC2 I/O streams) rather than killed unrecoverably with SIGKILL. (Notably, that probably means using the new exec.Cmd.Cancel field where available.)

Not sure I agree with this one: our programs must be prepared for sudden death (i.e., be idempotent and transactional in their updates of persistent state) since they will have to deal with this in actual use. It's easier to detect bugs--such as this one--when we don't build in layers of imperfect mitigation such as "graceful shutdown". kill -KILL should be an acceptable way to terminate the process.

gopherbot · 2023-04-06T18:32:04Z

Change https://go.dev/cl/482818 mentions this issue: gopls/internal/lsp/filecache: add integrity check

This change appends a 32-bit CRC checksum to the end of each cache file in Set, and checks it in Get, and dumps the file content if it doesn't match the content. This is intended to help investigate golang/go#59289. Updates golang/go#59289 Change-Id: Ia439d915c37d33b0af5d0da11a8db63762cd80a7 Reviewed-on: https://go-review.googlesource.com/c/tools/+/482818 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Robert Findley <rfindley@google.com> Auto-Submit: Alan Donovan <adonovan@google.com> Run-TryBot: Alan Donovan <adonovan@google.com>

findleyr · 2023-05-03T17:34:26Z

Fixed by https://go.dev/cl/492035. Not sure what else we can do.

gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 28, 2023

gopherbot added Tools This label describes issues relating to any tools in the x/tools repository. gopls Issues related to the Go language server, gopls. labels Mar 28, 2023

gopherbot added this to the Unreleased milestone Mar 28, 2023

findleyr modified the milestones: Unreleased, gopls/v0.12.0 Mar 28, 2023

adonovan self-assigned this Mar 30, 2023

bcmills changed the title ~~x/tools/gopls/internal/lsp/cmd/test: TestLinks failures~~ x/tools/gopls: test failures with invalid character '\x00' Apr 3, 2023

bcmills mentioned this issue Apr 3, 2023

x/tools/gopls/internal/lsp/cmd/test: TestPrepareRename failures #59358

Closed

adonovan added the gopls/incremental label Apr 4, 2023

This was referenced Apr 7, 2023

x/tools/gopls/internal/lsp/cmd/test: TestImports failures #58152

Closed

x/tools/gopls: don't discard errors computing diagnostics #59487

Open

findleyr closed this as completed May 3, 2023

findleyr mentioned this issue May 3, 2023

x/tools/gopls/internal/lsp/cmd/test: TestImplementations failures #59105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/tools/gopls: test failures with `invalid character '\x00'` #59289

x/tools/gopls: test failures with `invalid character '\x00'` #59289

gopherbot commented Mar 28, 2023 •

edited by bcmills

gopherbot commented Mar 28, 2023

findleyr commented Mar 28, 2023

gopherbot commented Apr 3, 2023

adonovan commented Apr 3, 2023

findleyr commented Apr 5, 2023

findleyr commented Apr 5, 2023

adonovan commented Apr 6, 2023 •

edited

findleyr commented Apr 6, 2023 •

edited

findleyr commented Apr 6, 2023

adonovan commented Apr 6, 2023 •

edited

adonovan commented Apr 6, 2023

findleyr commented Apr 6, 2023

bcmills commented Apr 6, 2023 •

edited

bcmills commented Apr 6, 2023

adonovan commented Apr 6, 2023

gopherbot commented Apr 6, 2023

findleyr commented May 3, 2023

Navigation Menu

x/tools/gopls: test failures with invalid character '\x00' #59289

x/tools/gopls: test failures with invalid character '\x00' #59289

Comments

gopherbot commented Mar 28, 2023 • edited by bcmills

gopherbot commented Mar 28, 2023

findleyr commented Mar 28, 2023

gopherbot commented Apr 3, 2023

adonovan commented Apr 3, 2023

findleyr commented Apr 5, 2023

findleyr commented Apr 5, 2023

adonovan commented Apr 6, 2023 • edited

findleyr commented Apr 6, 2023 • edited

findleyr commented Apr 6, 2023

adonovan commented Apr 6, 2023 • edited

adonovan commented Apr 6, 2023

findleyr commented Apr 6, 2023

bcmills commented Apr 6, 2023 • edited

bcmills commented Apr 6, 2023

adonovan commented Apr 6, 2023

gopherbot commented Apr 6, 2023

findleyr commented May 3, 2023

x/tools/gopls: test failures with `invalid character '\x00'` #59289

x/tools/gopls: test failures with `invalid character '\x00'` #59289

gopherbot commented Mar 28, 2023 •

edited by bcmills

adonovan commented Apr 6, 2023 •

edited

findleyr commented Apr 6, 2023 •

edited

adonovan commented Apr 6, 2023 •

edited

bcmills commented Apr 6, 2023 •

edited