Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: unexpected nil dereference on s390x #38195

Closed
mundaym opened this issue Apr 1, 2020 · 10 comments
Closed

cmd/compile: unexpected nil dereference on s390x #38195

mundaym opened this issue Apr 1, 2020 · 10 comments
Labels
arch-s390x Issues solely affecting the s390x architecture. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mundaym
Copy link
Member

mundaym commented Apr 1, 2020

This is a follow up issue to mikefarah/yq#394 which only appears to affect s390x. That issue indicates a bug in the compiler that affects Go 1.13.x and Go 1.14.x. Git bisect points to CL 173323 and if that CL is applied to Go 1.12.x then that release is also affected. I am fairly certain that CL 173323 isn't the root cause however. It appears to have revealed a latent bug.

To reproduce:

git clone https://gopkg.in/yaml.v3
cd yaml.v3
go test

You will see unexpected nil dereference panics in most of the tests. For example:

--- FAIL: ExampleUnmarshal_embedded (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x16a06c]

goroutine 1 [running]:
testing.(*InternalExample).processRunResult(0xc001477d08, 0x0, 0x0, 0x11961, 0x235940, 0x425c20, 0x0)
        /home/mndy/go/master/src/testing/example.go:89 +0x52a
testing.runExample.func2(0xbf99381e5b419297, 0x5bbc282, 0x43e0c0, 0xc00160e6e0, 0xc000010018, 0xc0000785a0, 0xc0005b5d08, 0xc001477d38)
        /home/mndy/go/master/src/testing/run_example.go:58 +0xfc
panic(0x235940, 0x425c20)
        /home/mndy/go/master/src/runtime/panic.go:969 +0x132
gopkg.in/yaml%2ev3.handleErr(0xc0005b5bd0)
        /home/mndy/go/yaml.v3/yaml.go:276 +0xae
panic(0x235940, 0x425c20)
        /home/mndy/go/master/src/runtime/panic.go:969 +0x132
gopkg.in/yaml%2ev3.yaml_parser_parse_node(0xc001437000, 0xc0014372b0, 0x100000000000080, 0xf)
        /home/mndy/go/yaml.v3/parserc.go:568 +0x49c
gopkg.in/yaml%2ev3.yaml_parser_state_machine(0xc001437000, 0xc0014372b0, 0x0)
        /home/mndy/go/yaml.v3/parserc.go:170 +0x1a4
gopkg.in/yaml%2ev3.yaml_parser_parse(0xc001437000, 0xc0014372b0, 0x3ff8e0bb560)
        /home/mndy/go/yaml.v3/parserc.go:129 +0x82
gopkg.in/yaml%2ev3.(*parser).peek(0xc001437000, 0x0)
        /home/mndy/go/yaml.v3/decode.go:102 +0x3c
gopkg.in/yaml%2ev3.(*parser).parse(0xc001437000, 0x0)
        /home/mndy/go/yaml.v3/decode.go:141 +0x36
gopkg.in/yaml%2ev3.(*parser).parseChild(0xc001437000, 0xc0001208c0, 0x0)
        /home/mndy/go/yaml.v3/decode.go:186 +0x2a
gopkg.in/yaml%2ev3.(*parser).document(0xc001437000, 0x300000000000030)
        /home/mndy/go/yaml.v3/decode.go:195 +0xa6
gopkg.in/yaml%2ev3.(*parser).parse(0xc001437000, 0x0)
        /home/mndy/go/yaml.v3/decode.go:151 +0x144
gopkg.in/yaml%2ev3.unmarshal(0xc0014145c0, 0x35, 0x40, 0x216740, 0xc00059c740, 0x40, 0x0, 0x0)
        /home/mndy/go/yaml.v3/yaml.go:161 +0x224
gopkg.in/yaml%2ev3.Unmarshal(...)
        /home/mndy/go/yaml.v3/yaml.go:89
gopkg.in/yaml%2ev3_test.ExampleUnmarshal_embedded()
        /home/mndy/go/yaml.v3/example_embedded_test.go:47 +0x88
testing.runExample(0x27966a, 0x19, 0x2842e0, 0x2806ca, 0x2e, 0x0, 0x0)
        /home/mndy/go/master/src/testing/run_example.go:62 +0x1a6
testing.runExamples(0xc001477ed8, 0x426de0, 0x1, 0x1, 0x0)
        /home/mndy/go/master/src/testing/example.go:44 +0x13c
testing.(*M).Run(0xc000136000, 0x0)
        /home/mndy/go/master/src/testing/testing.go:1221 +0x244
main.main()
        _testmain.go:71 +0x142

Things that 'fix' the issue:

  • Reverting CL 173323
  • Removing the 64-bit store multiple rules:
    (MOVDstore [i] {s} p w1 x:(MOVDstore [i-8] {s} p w0 mem))
    && p.Op != OpSB
    && x.Uses == 1
    && is20Bit(i-8)
    && clobber(x)
    -> (STMG2 [i-8] {s} p w0 w1 mem)
    (MOVDstore [i] {s} p w2 x:(STMG2 [i-16] {s} p w0 w1 mem))
    && x.Uses == 1
    && is20Bit(i-16)
    && clobber(x)
    -> (STMG3 [i-16] {s} p w0 w1 w2 mem)
    (MOVDstore [i] {s} p w3 x:(STMG3 [i-24] {s} p w0 w1 w2 mem))
    && x.Uses == 1
    && is20Bit(i-24)
    && clobber(x)
    -> (STMG4 [i-24] {s} p w0 w1 w2 w3 mem)
    (STMG2 [i] {s} p w2 w3 x:(STMG2 [i-16] {s} p w0 w1 mem))
    && x.Uses == 1
    && is20Bit(i-16)
    && clobber(x)
    -> (STMG4 [i-16] {s} p w0 w1 w2 w3 mem)
  • Using the -N (no opt) or -l=4 (max inlining) flags

None of these are the cause I think. I suspect an op is missing a clobberFlags mark or something like that. The store multiple ops should have the clobberFlags mark and currently don't but that doesn't appear to affect this issue (store multiple will only clobber flags if the displacement does not fit inside a int20 and this can only happen with autos - which would need a 512K stack frame).

@mundaym mundaym added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-s390x Issues solely affecting the s390x architecture. labels Apr 1, 2020
@mundaym mundaym added this to the Go1.15 milestone Apr 1, 2020
@mundaym mundaym self-assigned this Apr 1, 2020
@gopherbot
Copy link

Change https://golang.org/cl/226759 mentions this issue: cmd/compile: mark 'store multiple' as clobbering flags on s390x

gopherbot pushed a commit that referenced this issue Apr 2, 2020
Store multiple instructions can clobber flags on s390x when the
offset passed into the assembler is outside the range representable
with a signed 20 bit integer. This is because the assembler uses
the agfi instruction to implement the large offset. The assembler
could use a different sequence of instructions, but for now just
mark the instruction as 'clobberFlags' since this is risk free.

Noticed while investigating #38195.

No test yet since I'm not sure how to get this bug to trigger and
I haven't seen it affect real code.

Change-Id: I4a6ab96455a3ef8ffacb76ef0166b97eb40ff925
Reviewed-on: https://go-review.googlesource.com/c/go/+/226759
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@mundaym
Copy link
Member Author

mundaym commented Apr 6, 2020

This looks like a bug in the register allocation for the function. v1909 is allocated to R5, then R5 is taken by v1408 but v1909 (still in R5) is then used as a source to v1368.

b1:
    v9 (455) = Arg <*yaml_parser_t> {parser} : parser[*yaml_parser_t] (parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t], parser[*yaml_parser_t])
    v10 (455) = Arg <*yaml_event_t> {event} : event[*yaml_event_t] (event[*yaml_event_t], event[*yaml_event_t], event[*yaml_event_t], event[*yaml_event_t], event[*yaml_event_t])
    v11 (455) = Arg <bool> {block} : block[bool] (block[bool])
    v12 (455) = Arg <bool> {indentless_sequence} : indentless_sequence[bool] (indentless_sequence[bool])
    v1 (?) = InitMem <mem>
    v2 (?) = SP <uintptr> : SP
    v1910 (458) = LoadReg <*yaml_parser_t> v9 : R0
    v15 (458) = MOVDstore <mem> [8] v2 v1910 v1
    v16 (+458) = CALLstatic <mem> {"".peek_token} [16] v15
    v18 (458) = MOVDload <*yaml_token_t> [16] v2 v16 : R0 (token[*yaml_token_t])
CGIJ {LessOrGreater} v18 → b2 b3 (likely) (+459)

b2: ← b1
    v3 (?) = SB <uintptr> : SB
    v1909 (463) = Copy <*yaml_token_t> v18 : R5               <=== R5 = v1909
    v28 (+463) = MOVDload <yaml_token_type_t> v1909 v16 : R6
CGIJ {Equal} v28 → b5 b4 (unlikely) (463)

b4: ← b2
    v1055 (477) = MOVDload <int> [24] v1909 v16 : R3
    v846 (477) = StoreReg <int> v1055 : .autotmp_110[int]
    v1070 (+477) = MOVDload <int> [16] v1909 v16 : R2
    v1905 (+477) = StoreReg <int> v1070 : .autotmp_111[int]
    v1049 (+478) = MOVDload <int> [8] v1909 v16 : R1
    v1903 (+478) = StoreReg <int> v1049 : .autotmp_112[int]
CGIJ {Equal} v28 → b17 b18 (unlikely) (+483)

b18: ← b4
CGIJ {Equal} v28 → b32 b88 (unlikely) (+504)

b88: ← b18
    v1472 (477) = Copy <int> v1055 : R4
    v1480 (477) = Copy <int> v1070 : R5                      <=== R5 = v1480
    v1237 (478) = Copy <int> v1049 : R6
    v1578 (?) = MOVDconst <bool> [0] : R7
    v1334 (?) = MOVDconst <int> [0] : R8
    v1412 (?) = MOVDconst <*byte> [0] : R9
    v1948 (?) = MOVDconst <int> [0] : R11
    v1407 (?) = MOVDconst <int> [0] : R12
    v1327 (?) = MOVDconst <*byte> [0] : R14
    v1326 (541) = MOVDconst <int> [0] : R0
    v1396 (541) = StoreReg <int> v1326 : tag_mark.column[int]
    v1323 (541) = MOVDconst <int> [0] : R1
    v1632 (541) = StoreReg <int> v1323 : tag_mark.line[int]
    v1627 (541) = MOVDconst <int> [0] : R2
    v1623 (541) = StoreReg <int> v1627 : tag_mark.index[int]
    v1368 (551) = Copy <*yaml_token_t> v1909 : R0            <=== R0 = v1909 !!!
    v675 (646) = MOVDconst <int> [0] : R3
    v1217 (646) = StoreReg <int> v675 : anchor.cap[int]
    v1216 (646) = MOVDconst <int> [0] : R4
    v873 (646) = StoreReg <int> v1216 : anchor.len[int]
    v954 (646) = MOVDconst <*byte> [0] : R5
    v1213 (646) = StoreReg <*byte> v954 : anchor.ptr[*byte]
    v1212 (660) = LoadReg <int> v846 : R3
    v964 (660) = LoadReg <int> v1905 : R2
    v974 (660) = Copy <int> v1237 : R1
    v1209 (556) = Copy <int> v1212 : R4
    v1208 (556) = Copy <int> v964 : R5
Plain → b31 (+504)

Any thoughts on what might be the issue here @randall77 @dr2chase @cherrymui?

@mundaym
Copy link
Member Author

mundaym commented Apr 6, 2020

v1909 is created because v18 has been allocated to R0, which cannot be used as an address for loads or stores (it is always evaluated as 0 in address calculations).

This means that the value of v1909 (R5) and v18 (R0) both contain the same value. So I wonder if this is a subtle bug in the way we handle multiple registers containing the same value.

@mundaym
Copy link
Member Author

mundaym commented Apr 6, 2020

To generate the ssa.html for the bad code you can use master at 815509a:

git clone https://gopkg.in/yaml.v3
cd yaml.v3
git checkout 9f266ea
GOSSAFUNC=yaml_parser_parse_node GOOS=linux GOARCH=s390x go test -c

The value numbers will be different but the pattern is the same:

b1:
   ...
   v18 (458) = MOVDload <*yaml_token_t> [16] v2 v16 : R0 (token[*yaml_token_t]) <== set R0=v18
CGIJ {LessOrGreater} v18 → b2 b3 (likely) (+459)

b2: ← b1
   ...
   v1476 (463) = Copy <*yaml_token_t> v18 : R5 <== set R5=v1476
   ...
CGIJ {Equal} v28 → b5 b4 (unlikely) (463)

b4: ← b2
   ...
CGIJ {Equal} v28 → b17 b18 (unlikely) (+483)

b18: ← b4
CGIJ {Equal} v28 → b32 b90 (unlikely) (+504)

b90: ← b18
   ...
   v1241 (477) = Copy <int> v1074 : R5 <== clobber v1476, set R5=v1241
   ...
   v1631 (541) = MOVDconst <int> [0] : R0 <== clobber v18 (might not be relevant), set R0=v1631
   ...
   v877 (551) = Copy <*yaml_token_t> v1476 : R0 <== use v1476 but instead get v1241
   ...
Plain → b31 (+504)

@gopherbot
Copy link

Change https://golang.org/cl/228060 mentions this issue: cmd/compile: error if register is reused when setting edge state

@mundaym
Copy link
Member Author

mundaym commented Apr 14, 2020

I think the problem is this code - specifically the call to s.freeRegs:

if !regValLiveSet.contains(a.ID) {
// Input is dead beyond the phi, deallocate
// anywhere else it might live.
s.freeRegs(s.values[a.ID].regs)
} else {

That call frees all the copies of the value that is the input to the phi - some of which might actually be live and used by other values (as in this case). If the free'd value is subsequently used to copy a different phi input then the append in this code (as opposed to a replace function) leads to two copies of the same register in the predecessors endRegs slice, which is an invalid state and causes weird problems:

s.endRegs[p.ID] = append(s.endRegs[p.ID], endReg{r2, a, c})

CL 228060 adds an ICE if this invalid state is seen.

I think the solution to this specific bug is just to deallocate the register holding the input to the phi and leave any copies untouched.

In general though this code is really hairy and I'm still not really sure that even with this change it is correct. It might be worth spending some time simplifying how phi allocations work so there are fewer data structures to mentally track.

@gopherbot
Copy link

Change https://golang.org/cl/228061 mentions this issue: cmd/compile: fix deallocation of live value copies in regalloc

@mundaym
Copy link
Member Author

mundaym commented Apr 14, 2020

Keith put a good summary of this issue in the CL for anyone following along:

Here's what I think the bug in the old code is:

s.freeRegs(s.values[a.ID].regs)

This frees those registers in the new block (the one with the phi), but it does not free those registers in the predecessor block. In particular, they are still in s.endRegs[p.ID].
But it does remove those registers from s.used, allowing them to be reused at the end of the predecessor block to make a copy. If, for a later phi, we choose one of those deallocated-in-the-current-block-but-not-in-the-predecessor-block registers, we get two things in the same register.

The fix looks correct. If we only deallocate registers that are in phiUsed, then those registers would never be reused during later phis. (We allocate from all - s.used - phiUsed.)

gopherbot pushed a commit that referenced this issue Apr 14, 2020
When setting the edge state in register allocation we should only
be setting each register once. It is not possible for a register
to hold multiple values at once.

This CL converts the runtime error seen in #38195 into an internal
compiler error (ICE). It is better for the compiler to fail than
generate an incorrect program.

The bug reported in #38195 is now exposed as:

./parserc.go:459:11: internal compiler error: 'yaml_parser_parse_node': R5 is already set (v1074/v1241)

[stack trace]

Updates #38195.

Change-Id: Id95842fd850b95494cbd472b6fd5a55513ecacec
Reviewed-on: https://go-review.googlesource.com/c/go/+/228060
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@mundaym
Copy link
Member Author

mundaym commented Apr 14, 2020

@gopherbot please consider this for a backport

This is a bug in the register allocator that results in the compiled program behaving incorrectly. There is currently no workaround and it results in very hard to debug issues. It has only been seen to cause problems on s390x but theoretically it affects all platforms.

@gopherbot
Copy link

Backport issue(s) opened: #38442 (for 1.13), #38443 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-s390x Issues solely affecting the s390x architecture. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

2 participants