Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: increase os.ReadDir() dirent buffer from 8KiB to 32KiB #64597

Open
xutao323 opened this issue Dec 7, 2023 · 2 comments
Open

os: increase os.ReadDir() dirent buffer from 8KiB to 32KiB #64597

xutao323 opened this issue Dec 7, 2023 · 2 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@xutao323
Copy link

xutao323 commented Dec 7, 2023

Proposal Details

Propose to increase the dirent buffer blockSize from 8KiB to 32KiB in src/os/dir_unix.go for os.ReadDir() to align with glibc [1] [2], as they both wrap getdents64() syscall in Linux systems.

Background

Machine: 4 core 8 GiB VM via UTM on MacBook Pro, Apple M2 Pro arm64 CPU, native virtualization
OS: Fedora 38 (tried CentOS 9, Ubuntu 22), Linux 6.5.10 (tried V1 patch in [3]), 4 KiB page size (tried 16 KiB)
Go: go version go1.20.10 linux/arm64

I was trying out continuous profiling tools (pyroscope, inspektor-gadget) and found os.ReadDir("/proc") sometimes dominate the CPU time while CPU util% is low like <=5%. Here is the flamegraph.

Observation

os.ReadDir() is a wrapper of getdents64() like glibc, confirm it by tracing (2nd column is latency):

# perf trace -e getdents64 -p `pidof gadgettracermanager` sleep 3
     0.000 ( 0.335 ms): gadgettracerma/8544 getdents64(fd: 28, dirent: 0x40031a4000, count: 8192) = 8184
     0.397 ( 0.093 ms): gadgettracerma/8544 getdents64(fd: 28, dirent: 0x40031a4000, count: 8192) = 688
     0.665 ( 0.187 ms): gadgettracerma/8544 getdents64(fd: 28, dirent: 0x40031a4000, count: 8192) = 0
  1000.545 ( 1.035 ms): gadgettracerma/8544 getdents64(fd: 28, dirent: 0x4000af2000, count: 8192) = 8184
  1001.785 ( 0.258 ms): gadgettracerma/8544 getdents64(fd: 28, dirent: 0x4000af2000, count: 8192) = 688
  1002.070 ( 0.058 ms): gadgettracerma/8544 getdents64(fd: 28, dirent: 0x4000af2000, count: 8192) = 0
  2000.946 ( 1.208 ms): gadgettracerma/8544 getdents64(fd: 28, dirent: 0x40031a4000, count: 8192) = 8184
  2002.478 ( 0.199 ms): gadgettracerma/8544 getdents64(fd: 28, dirent: 0x40031a4000, count: 8192) = 688
  2002.820 ( 0.087 ms): gadgettracerma/8544 getdents64(fd: 28, dirent: 0x40031a4000, count: 8192) = 0

The monitoring daemon calls os.ReadDir("/proc") every 1s to get all process IDs, and latency spikes (1ms to 2ms) can be observed. The dirent buffer size is 8KiB, one os.ReadDir() results in 3 getdents64() to get all the PIDs (roughly 300).

Trace ps to see how glibc behaves:

# perf trace -e getdents64 ps -eo pid > /dev/null
     0.000 ( 0.206 ms): ps/74672 getdents64(fd: 5, dirent: 0xaaaad37a9080, count: 32768) = 9552
     3.328 ( 0.023 ms): ps/74672 getdents64(fd: 5, dirent: 0xaaaad37a9080, count: 32768) = 0

glibc uses 32KiB dirent buffer [1] [2], so result in 2 getdents64(), less syscalls. Latency is lower probably because of directory cache (dcache) hit warmed up by the monitoring daemon every 1s, see the reproduce program traces below.

Reproduce

This os.ReadDir() hotspot can be reproduced without the monitoring system:

# cat readdir.go
package main

import (
    "fmt"
    "os"
    "strconv"
    "time"
)

func repeat_routine(dir string, loop int) {
    var output []int
    for i := 0; i < loop; i++ {
        dirEntries, err := os.ReadDir(dir)
        check(err)
        output = nil
        for _, entry := range dirEntries {
            pid, err := strconv.Atoi(entry.Name())
            if err != nil {
                // entry is not a process directory. Ignore.
                continue
            }
            output = append(output, pid)
        }
    }
    fmt.Println(output)
}

func check(e error) {
    if e != nil {
        panic(e)
    }
}

func main() {
    if len(os.Args) < 4 {
        fmt.Println("Need input: dir loop runtime");
        os.Exit(-1)
    }

    dir := os.Args[1]
    count, err := strconv.Atoi(os.Args[2])
    check(err)
    runtime, err := strconv.Atoi(os.Args[3])
    check(err)

    for i := 0; i < runtime; i++ {
        repeat_routine(dir, count)
        time.Sleep(time.Second)
    }
}

Run above reproduce program and monitoring daemons at the same time:

# perf trace -e getdents64 ./readdir_go /proc 1 3 > /dev/null
     0.000 ( 0.272 ms): readdir_go/65676 getdents64(fd: 3, dirent: 0x40000b2000, count: 8192) = 8184
     0.314 ( 0.054 ms): readdir_go/65676 getdents64(fd: 3, dirent: 0x40000b2000, count: 8192) = 688
     0.384 ( 0.016 ms): readdir_go/65676 getdents64(fd: 3, dirent: 0x40000b2000, count: 8192) = 0
  1000.748 ( 0.368 ms): readdir_go/65676 getdents64(fd: 3, dirent: 0x4000108000, count: 8192) = 8184
  1001.295 ( 0.233 ms): readdir_go/65676 getdents64(fd: 3, dirent: 0x4000108000, count: 8192) = 688
  1001.591 ( 0.044 ms): readdir_go/65676 getdents64(fd: 3, dirent: 0x4000108000, count: 8192) = 0
  2003.251 ( 0.329 ms): readdir_go/65676 getdents64(fd: 3, dirent: 0x40000b2000, count: 8192) = 8184
  2003.802 ( 0.117 ms): readdir_go/65676 getdents64(fd: 3, dirent: 0x40000b2000, count: 8192) = 688
  2004.061 ( 0.070 ms): readdir_go/65676 getdents64(fd: 3, dirent: 0x40000b2000, count: 8192) = 0

Stop the monitoring daemons and run the program alone (8KiB buffer is sufficient when less threads):

# perf trace -e getdents64 ./readdir_go /proc 1 3 > /dev/null
     0.000 ( 0.271 ms): readdir_go/66797 getdents64(fd: 3, dirent: 0x4000120000, count: 8192) = 7920
     0.319 ( 0.018 ms): readdir_go/66797 getdents64(fd: 3, dirent: 0x4000120000, count: 8192) = 0
  1005.351 ( 1.412 ms): readdir_go/66797 getdents64(fd: 3, dirent: 0x40000a4000, count: 8192) = 7920
  1007.017 ( 0.200 ms): readdir_go/66797 getdents64(fd: 3, dirent: 0x40000a4000, count: 8192) = 0
  2012.598 ( 1.014 ms): readdir_go/66797 getdents64(fd: 3, dirent: 0x4000120000, count: 8192) = 7920
  2013.985 ( 0.126 ms): readdir_go/66797 getdents64(fd: 3, dirent: 0x4000120000, count: 8192) = 0

Bottom-up analysis

System level

Some search results [3] [4] show Linux procfs has scalability issue in proc_pid_readdir() when many threads. Here it's a 4 core 8 GiB arm64 VM with 300 threads, and system is idle without any workloads yet. I did some verifications:

  1. Guest VM is running 4KiB page size kernel while host M2 processor's native page size is 16KiB. Memory traversal may be less efficient in this manner. I tried 16KiB page guest kernel, but not helpful.
  2. I tried VMs in the cloud with same 4 core 8 GiB configuration and same software, os.ReadDir("/proc")'s max latency is lower than desktop VM: AMD Epyc 7T83 VM is < 0.28ms, ARM Neoverse N2 VM is < 0.36ms.
  3. I applied the Linux kernel patch V1 in [3], it's helpful to eliminate radix-tree hotspot, but remaining part __d_lookup() becomes most significant in proc_pid_readdir(). Here is the flamegraph for it.
  4. torvalds/linux@3ba4bce mentioned proc_pid_readdir() can take more than 50ms and added cond_resched() in it.

So os.ReadDir("/proc") can be slow in Linux systems, sometimes because of the processor, sometimes system loads.

Standard library

I found #24015 increased the blockSize to 8KiB to fix missing CIFS file bug. It's somewhat similar to the comment in glibc [2], so increase to the same 32KiB lower limit as glibc may be good for both compatibility and performance described above.

Checked solaris/illumos libc as a reference, filesystem independent dirent buffer lower limit is 8KiB, and upper limit is 64KiB.

Application level

Monitoring software checks PIDs often, some may change to use sysfs instead of procfs, like the cgroup.procs file for container process where linux sub-system stores the PIDs in one memory place without frequent memory traversal in /proc.

Suggestion

Align with glibc implementation to allocate 32KiB buffer for dirent buffer for better performance and compatiblity.

diff --git a/src/os/dir_unix.go b/src/os/dir_unix.go
index 266a78acaf..6e8c821c21 100644
--- a/src/os/dir_unix.go
+++ b/src/os/dir_unix.go
@@ -23,7 +23,7 @@ type dirInfo struct {
 
 const (
        // More than 5760 to work around https://golang.org/issue/24015.
-       blockSize = 8192
+       blockSize = 32768
 )

[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=4b962c9e859de23b461d61f860dbd3f21311e83a
[2] https://inbox.sourceware.org/libc-alpha/87lfml9gl7.fsf@mid.deneb.enyo.de/
[3] https://lore.kernel.org/linux-fsdevel/20220614180949.102914-1-bfoster@redhat.com/
[4] https://www.usenix.org/system/files/srecon22emea_slides_liku.pdf

@gopherbot gopherbot added this to the Proposal milestone Dec 7, 2023
@mauri870
Copy link
Member

mauri870 commented Dec 7, 2023

Thanks for the thorough investigation on the matter and the nice write-up of the findings!

I don't see any issues with raising the buffer to 32K, specially since glibc already did it.

Taking this out of the proposal process as it doesn't involve API changes.

@mauri870 mauri870 removed this from the Proposal milestone Dec 7, 2023
@mauri870 mauri870 removed the Proposal label Dec 7, 2023
@mauri870 mauri870 changed the title proposal: os: increase os.ReadDir() dirent buffer from 8KiB to 32KiB os: increase os.ReadDir() dirent buffer from 8KiB to 32KiB Dec 7, 2023
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 8, 2023
@cagedmantis cagedmantis added this to the Backlog milestone Dec 8, 2023
@cagedmantis
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants