cmd/compile: optimize repeated getg() calls #29671

CAFxX · 2019-01-11T05:37:24Z

While playing around with inlining the fast paths of runtime.lock/unlock I noticed that repeated getg() calls don't get optimized away. This is especially visible with midstack inlining, because it is more likely that a function calling getg() is inlined in another function calling getg(). This results in something like:

TEXT runtime.chanrecv(SB) /home/cafxx/dev/go/src/runtime/chan.go
  chan.go:421           0x407700                64488b0c25f8ffffff      MOVQ FS:0xfffffff8, CX                  
  chan.go:421           0x407709                488d4424f8              LEAQ -0x8(SP), AX                       
  chan.go:421           0x40770e                483b4110                CMPQ 0x10(CX), AX                       
  chan.go:421           0x407712                0f8652070000            JBE 0x407e6a                            
  chan.go:421           0x407718                4881ec88000000          SUBQ $0x88, SP                          
  chan.go:421           0x40771f                4889ac2480000000        MOVQ BP, 0x80(SP)                       
  chan.go:421           0x407727                488dac2480000000        LEAQ 0x80(SP), BP                       
  chan.go:429           0x40772f                488b842490000000        MOVQ 0x90(SP), AX                       
  chan.go:429           0x407737                4885c0                  TESTQ AX, AX                            
  chan.go:429           0x40773a                0f84ac060000            JE 0x407dec                             
  chan.go:421           0x407740                0fb68c24a0000000        MOVZX 0xa0(SP), CX                      
  chan.go:421           0x407748                84c9                    TESTL CL, CL                            
  chan.go:449           0x40774a                0f844c060000            JE 0x407d9c                             
  chan.go:456           0x407750                48833d8837560000        CMPQ $0x0, runtime.blockprofilerate(SB) 
  chan.go:456           0x407758                0f8720060000            JA 0x407d7e                             
  chan.go:456           0x40775e                31d2                    XORL DX, DX                             
  chan.go:511           0x407760                4889542428              MOVQ DX, 0x28(SP)                       
  lock_futex.go:47      0x407765                64488b1c25f8ffffff      MOVQ FS:0xfffffff8, BX                  
  lock_futex.go:51      0x40776e                488b5b30                MOVQ 0x30(BX), BX                       
  chan.go:460           0x407772                488d7058                LEAQ 0x58(AX), SI                       
  chan.go:460           0x407776                4889742448              MOVQ SI, 0x48(SP)                       
  lock_futex.go:51      0x40777b                8bbb08010000            MOVL 0x108(BX), DI                      
  lock_futex.go:51      0x407781                85ff                    TESTL DI, DI                            
  lock_futex.go:51      0x407783                0f8cf1050000            JL 0x407d7a                             
  lock_futex.go:52      0x407789                ffc7                    INCL DI                                 
  lock_futex.go:52      0x40778b                89bb08010000            MOVL DI, 0x108(BX)                      
  lock_futex.go:55      0x407791                bb01000000              MOVL $0x1, BX                           
  lock_futex.go:55      0x407796                875858                  XCHGL BX, 0x58(AX)                      
  lock_futex.go:56      0x407799                85db                    TESTL BX, BX                            
  lock_futex.go:56      0x40779b                0f85ad050000            JNE 0x407d4e                            
  chan.go:462           0x4077a1                83781c00                CMPL $0x0, 0x1c(AX)                     
  chan.go:462           0x4077a5                0f8495000000            JE 0x407840                             
  chan.go:462           0x4077ab                48833800                CMPQ $0x0, 0(AX)                        
  chan.go:462           0x4077af                0f858b000000            JNE 0x407840                            
  lock_futex.go:120     0x4077b5                31c9                    XORL CX, CX                             
  lock_futex.go:120     0x4077b7                874858                  XCHGL CX, 0x58(AX)                      
  lock_futex.go:122     0x4077ba                83f901                  CMPL $0x1, CX                           
  lock_futex.go:122     0x4077bd                7520                    JNE 0x4077df                            
  lock_futex.go:123     0x4077bf                64488b1425f8ffffff      MOVQ FS:0xfffffff8, DX                  
  lock_futex.go:125     0x4077c8                488b5a30                MOVQ 0x30(DX), BX                       
  lock_futex.go:125     0x4077cc                8bbb08010000            MOVL 0x108(BX), DI                      
  lock_futex.go:125     0x4077d2                85ff                    TESTL DI, DI                            
  lock_futex.go:125     0x4077d4                7e09                    JLE 0x4077df                            
  lock_futex.go:125     0x4077d6                80bab100000000          CMPB $0x0, 0xb1(DX)                     
  lock_futex.go:125     0x4077dd                7456                    JE 0x407835                             
  lock_futex.go:132     0x4077df                48893424                MOVQ SI, 0(SP)                          
  lock_futex.go:132     0x4077e3                894c2408                MOVL CX, 0x8(SP)                        
  lock_futex.go:132     0x4077e7                e8e4560000              CALL runtime.unlockSlow(SB)

Notice the three loads from FS:-8, two of which are part of the repeated sequence

MOVQ FS:0xfffffff8, <R1>                  
MOVQ 0x30(<R1>), <R2>                       
MOVL 0x108(<R2>), <R3>

IIUC it's true that the value of <R1> could change if the G is migrated to a different thread, but the address of the G itself (<R2>) should not change even if this happens.

The text was updated successfully, but these errors were encountered:

randall77 · 2019-01-11T05:53:32Z

I originally made the implementation of getg (the GetG SSA op) not take a memory, so they would all get CSE'd. It works for most code, but there's code in the runtime where it doesn't work. That code does:

    g1 := getg()
    setg(...)
    g2 := getg()

And in that situation, you can't CSE them.
At that point I dropped the idea and gave GetG a memory argument, which prevents CSEing across calls. Unfortunately, it also prevents most other CSEing also.

I think it would be safe to allow CSEing of getg anywhere but the runtime. It's just an extra complication that didn't seem worth it at the time (2 getg ops, one with a memory arg and one without).

Do you have a benchmark that demonstrates any slowdown?

CAFxX · 2019-01-11T06:03:10Z

I think it would be safe to allow CSEing of getg anywhere but the runtime.

Isn't the runtime the only place where you can call getg() from anyway? If so, something more granular may be required (like the runtime write barrier annotations? or even better detecting if the current function recursively calls setg()?).

edit: ah, maybe you were referring to getg() called by procPin/procUnpin exported to sync

Do you have a benchmark that demonstrates any slowdown?

Not really, and I am having a hard time imagining how to come up with one. My main motivator for filing this issue was to counteract the text size/icache pollution increase due to inlining of those fast paths.

randall77 · 2019-01-11T07:36:32Z

Isn't the runtime the only place where you can call getg() from anyway?

Indeed, it looks like that is the case at the moment.
Of course inlining makes that a bit fuzzy; a getg could be inlined into non-runtime code. Not that I see any obvious examples where that might happen in practice.

cherrymui · 2019-01-11T16:53:56Z

The new internal ABI could potentially make it possible to reserve a G register on AMD64. Then getg generates no instruction, like on non-x86 architectures.

cherrymui · 2021-09-08T18:00:10Z

With the new ABI we now have a reserved G register (R14). I think we can close this. Feel free to reopen if there is more to address.

mvdan added the Performance label Jan 11, 2019

julieqiu added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 28, 2019

cherrymui closed this as completed Sep 8, 2021

golang locked and limited conversation to collaborators Sep 8, 2022

gopherbot added the FrozenDueToAge label Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: optimize repeated getg() calls #29671

cmd/compile: optimize repeated getg() calls #29671

CAFxX commented Jan 11, 2019 •

edited

randall77 commented Jan 11, 2019

CAFxX commented Jan 11, 2019 •

edited

randall77 commented Jan 11, 2019

cherrymui commented Jan 11, 2019

cherrymui commented Sep 8, 2021

cmd/compile: optimize repeated getg() calls #29671

cmd/compile: optimize repeated getg() calls #29671

Comments

CAFxX commented Jan 11, 2019 • edited

randall77 commented Jan 11, 2019

CAFxX commented Jan 11, 2019 • edited

randall77 commented Jan 11, 2019

cherrymui commented Jan 11, 2019

cherrymui commented Sep 8, 2021

CAFxX commented Jan 11, 2019 •

edited

CAFxX commented Jan 11, 2019 •

edited