Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: performance of go wasm is very poor #65440

Open
liutao-liu opened this issue Feb 2, 2024 · 13 comments
Open

cmd/compile: performance of go wasm is very poor #65440

liutao-liu opened this issue Feb 2, 2024 · 13 comments
Labels
arch-wasm WebAssembly issues compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@liutao-liu
Copy link

liutao-liu commented Feb 2, 2024

Go version

go version go1.21.6 linux/arm64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPRIVATE=''
GOSUMDB='off'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOVCS=''
GOVERSION='go1.21.6'
GCCGO='gccgo'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3484327849=/tmp/go-build -gno-record-gcc-switches'

What did you do?

As shown in the following example,test.go is compiled into wasm:

test.go :

//go:noinline
func testFor() int {
	sum := 0
	for i := 0; i < 200; i++ {
		for j := 0; j < 10000000; j++ {
			sum += j
		}
	}
	return sum
}

func main() {
	startTime := time.Now()

	res := testFor()
	elapsed := time.Since(startTime)
	fmt.Println("Done. Cost: ", elapsed, res)
}

go to wasm compile command:
GOOS=wasip1 GOARCH=wasm go build -o test.wasm test.go

What did you see happen?

As you can see in the wat code, the for loop is expressed using the br_table operation.

wat code:

  (func $main.testFor (type 0) (param i32) (result i32)
    (local i32 i64 i64 i64 i64)
    global.get 0
    local.set 1
    loop  ;; label = @1
      block  ;; label = @2
        block  ;; label = @3
          block  ;; label = @4
            block  ;; label = @5
              block  ;; label = @6
                block  ;; label = @7
                  block  ;; label = @8
                    block  ;; label = @9
                      local.get 0
                      br_table 0 (;@9;) 1 (;@8;) 2 (;@7;) 3 (;@6;) 4 (;@5;) 5 (;@4;) 6 (;@3;) 7 (;@2;)
                    end
                    i64.const 0
                    local.set 2
                    i64.const 0
                    local.set 3
                    i32.const 2
                    local.set 0
                    br 7 (;@1;)
                  end
                  local.get 2
                  i64.const 1
                  i64.add
                  local.set 2
                end
                local.get 2
                i64.const 200
                i64.lt_s
                i32.eqz
                if  ;; label = @7
                  i32.const 4
                  local.set 0
                  br 6 (;@1;)
                end
              end
              i64.const 0
              local.set 4
              i32.const 6
              local.set 0
              br 4 (;@1;)
            end
            local.get 1
            i64.extend_i32_u
            i64.const 8
            i64.add
            i32.wrap_i64
            local.get 3
            i64.store
            local.get 1
            i32.const 8
            i32.add
            local.tee 1
            global.set 0
            i32.const 0
            return
          end
          local.get 4
          i64.const 1
          i64.add
          local.set 5
          local.get 3
          local.get 4
          i64.add
          local.set 3
          local.get 5
          local.set 4
        end
        local.get 4
        i64.const 10000000
        i64.lt_s
        if  ;; label = @3
          i32.const 5
          local.set 0
          br 2 (;@1;)
        end
        i32.const 1
        local.set 0
        br 1 (;@1;)
      end
    end
    unreachable)

What did you expect to see?

When the aot compiler of wasm runtime performs backend optimization, it is difficult to identify the br_table as a for loop. So, during the backend optimization, this for loop was not optimized.

I tested several of the most popular wasm runtimes, such as wasmtime, wamr, and wasmer, and I found that the performance of go wasm after aot compilation was very poor, and the runtime performance was only 20% of go native in the best case.
Why use so many br_table operation instead of loop operation? Will the performance of go wasm be optimized in the future?

Also, I found that the wat code of the go runtime functions uses br_table a lot,the craziest function has 417 hops in the br_table.

image

@mauri870 mauri870 changed the title Performance of go wasm is very poor cmd/go: performance of go wasm is very poor Feb 2, 2024
@mauri870 mauri870 added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-wasm WebAssembly issues labels Feb 2, 2024
@mauri870
Copy link
Member

mauri870 commented Feb 2, 2024

cc @golang/wasm @golang/compiler

@dmitshur dmitshur changed the title cmd/go: performance of go wasm is very poor cmd/compile: performance of go wasm is very poor Feb 2, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 2, 2024
@liutao-liu
Copy link
Author

liutao-liu commented Feb 2, 2024

I manually modified the wat code to express the inner for loop using the loop operation, and the performance after aot compiled increased by 500%. (Added extra tests to make sure the function is working correctly.)

(func $main.testFor (type 0) (param i32) (result i32)
    (local i32 i64 i64 i64 i64)
    global.get 0
    local.set 1
    loop  ;; label = @1
      block  ;; label = @2
        block  ;; label = @3
          block  ;; label = @4
            block  ;; label = @5
              block  ;; label = @6
                block  ;; label = @7
                  block  ;; label = @8
                    block  ;; label = @9
                      local.get 0
                      br_table 0 (;@9;) 1 (;@8;) 2 (;@7;) 3 (;@6;) 4 (;@5;) 5 (;@4;) 6 (;@3;) 7 (;@2;)
                    end
                    i64.const 0
                    local.set 2 ;; i = 0
                    i64.const 0 
                    local.set 3 ;; sum = 0
                    i32.const 2 
                    local.set 0 
                    br 7 (;@1;)
                  end
                  local.get 2
                  i64.const 1
                  i64.add      ;; i++
                  local.set 2
                end
                local.get 2   
                i64.const 200
                i64.lt_s  
                i32.eqz
                if  ;; label = @7  
                  i32.const 4
                  local.set 0
                  br 6 (;@1;)  
                end
              end
              i64.const 0
              local.set 4 
              i32.const 6
              local.set 0
              br 4 (;@1;) 
            end
            local.get 1    
            i64.extend_i32_u
            i64.const 8
            i64.add
            i32.wrap_i64
            local.get 3  
            i64.store    
            local.get 1
            i32.const 8
            i32.add		
            local.tee 1
            global.set 0
            i32.const 0
            return
          end
            loop $myloop ;; inner for loop start
            local.get 4  ;; get j
            i64.const 1
            i64.add     
            local.set 5 
            local.get 3 
            local.get 4 
            i64.add	 
            local.set 3 
            local.get 5
            local.set 4 
            local.get 4   
            i64.const 10000000
            i64.lt_s     	
            br_if $myloop ;; inner for loop end
          end
        end
        local.get 4   ;; j
        i64.const 10000000
        i64.lt_s     		;; j < 10000000
        if  ;; label = @3
          i32.const 5
          local.set 0
          br 2 (;@1;)  
        end
        i32.const 1
        local.set 0
        br 1 (;@1;)
      end
    end
    unreachable)

@evanphx
Copy link
Contributor

evanphx commented Feb 2, 2024

The reason for the use of br_table is to support goroutines in WebAssembly's single threaded environment. The generated code is able to unwind and rewind the WebAssembly stack when goroutines are switched, and so each block of code is accessible via br_table.

@liutao-liu
Copy link
Author

liutao-liu commented Feb 2, 2024

The reason for the use of br_table is to support goroutines in WebAssembly's single threaded environment. The generated code is able to unwind and rewind the WebAssembly stack when goroutines are switched, and so each block of code is accessible via br_table.

Thank you for your reply.

Does that mean that the current solution of using br_table is a temporary solution to WebAssembly's single-threaded environment? Currently, the performance of go wasm is poor, is there any optimization plan for the use of br_table?

@neelance
Copy link
Member

neelance commented Feb 2, 2024

See huge discussion here on why WebAssembly is designed in such a way: WebAssembly/design#796 (comment)

So far this design flaw (my opinion) has not been resolved.

@cherrymui
Copy link
Member

It might be possible to compile the innermost loop with the loop instruction if it is not preemptible (contains no calls). On the other hand, nonpreemptible loop is not a good thing in general. Maybe it is fine on Wasm as there is no preemption anyway?

@evanphx
Copy link
Contributor

evanphx commented Feb 2, 2024 via email

@Jorropo
Copy link
Member

Jorropo commented Feb 4, 2024

Even if we had threads I don't think this would allow us to rely on the host for scheduling, we need GC and resizable stacks which the host is unlikely to implement like we want it to. (wasm gc still has no support for inner pointers AFAIK)

@Jorropo
Copy link
Member

Jorropo commented Feb 4, 2024

After looking at WebAssembly/design#796 (really good link btw) I think we could emulate goto using the newly introduced Tail Call extension, each basic block would be it's own function and we would use tail calls to jump between them. It might still be interesting to use
I don't know details of how wasm JITs are implemented, if this compiles to efficient amd64 code this should still be pretty fast, code locality is likely to be bad however.

Would need to be benchmarked before trying out an implementation.
Support is terrible right now, only Firefox and Chrome ship it by default, most other applications support it with experimental flags, wasmer and webkit don't support it at all.

Edit turns out this was already proposed in WebAssembly/design#796 (comment):

If we're still stuck there, and tail calls are moving forward, maybe it'd be worth asking language compilers to still translate to gotos, but as a final step before WebAssembly output, split up the "label blocks" into functions, and convert the gotos into tail calls.

@cherrymui
Copy link
Member

I'm not sure we can use tail calls to emulate gotos. With goto (or branch table), it has the same set of local variables before and after the jump. With tail call, the local variables are all dead at the call. So we'd have to promote the local variables to global variables or something alike, which is certainly way more expensive, or pass them all as parameters, which complicates the code generation (and with unclear performance overhead).

@Jorropo
Copy link
Member

Jorropo commented Feb 5, 2024

@cherrymui I thought we didn't / rarely used local variables.
Looking at the code it seems to me we use linear memory as the "stack" and rarely use the wasm stack. As long as we pass the "SP" value through the tail call it should be fine.

@mknyszek
Copy link
Contributor

mknyszek commented Feb 5, 2024

I suspect something like https://github.com/WebAssembly/stack-switching would make a lot of these problems go away. In this model, the Go runtime's scheduler would be the parent fiber of all other fibers, and all other fibers would be goroutines. Calling into the scheduler or onto the systemstack then just involves switching to the parent fiber.

We'll still probably need to use linear memory as the data stack, but at least we'll have a much better way to change what is executing.

I do not know what the status of Wasm stack switching is at this point in time.

@cherrymui
Copy link
Member

@Jorropo we use local variables for "registers". The "shadow" stack in linear memory is used for non-registerized local variables like all other architectures. If we don't use "local variables", that essentially behaves like a non-registerized build, which would probably not be very performant.

I agree with @mknyszek that with the stack switching support, we probably don't need to use the top level loop and branch tables.

@mknyszek mknyszek added this to the Backlog milestone Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-wasm WebAssembly issues compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

8 participants