Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: time: POSIX style TZ strings on Unix and timezone handling optimisations #64659

Open
unixdj opened this issue Dec 11, 2023 · 1 comment

Comments

@unixdj
Copy link

unixdj commented Dec 11, 2023

Proposal Details

proposal: time: POSIX style TZ strings on Unix and timezone handling optimisations

Dear Gophers,

This proposal is about local timezone initialisation on Unix and other improvements to timezone handling. I already implemented most of the proposed features, but wanted to discuss it before submitting patches.

Related proposals:

CC: @rsc

References:

  • tzfile(5): compiled Zoneinfo file format
  • newtzset(3): TZ environment variable, POSIX style TZ strings
  • initLocal in src/time/zoneinfo_unix.go: local timezone initialisation
  • tzset in src/time/zoneinfo.go: POSIX style TZ string parser
  • lookup in src/time/zoneinfo.go: UTC offset lookup

tzcode is the code part of Zoneinfo, dealing with Zoneinfo files and timezone conversions. It's used in glibc and other Unix libc implementations.

A compiled Zoneinfo file contains zero or more static transitions and a TZ string that applies after the last static transition. The TZ string describes either a static zone or a pair of rules describing yearly transition times and target zones.

Introduction: TZ environment variable on Unix (libc/tzcode)

On Unix, the time package reads the local timezone information from a Zoneinfo file according to the value of the TZ environment variable: if it's unset, from /etc/localtime; if it's <file> or :<file>, from <file>. In case of any failure, UTC is used.

libc behaves similarly, but if the named file can not be read and the value does not start with ":", the value is parsed as a POSIX style TZ string. E.g., TZ=JST-9 date will display the date in a timezone named "JST" at UTC+9, and TZ=CET-1CEST,M3.5.0,M10.5.0/3 date in CET UTC+1 or CEST UTC+2 DST, the latter between last Sunday of March 02:00 CST and last Sunday of October 03:00 CEST.

POSIX style TZ strings

It would be nice to add support for such TZ settings to Go, to bring it in line with the rest of the system. The time package already has a parser for such strings, as they are used in compiled Zoneinfo files for timestamps after the last static transition.

The implementation requires a new error type for unknown timezones to be returned from loadLocation, so that initLocal can check the error and call tzset only when the zone is not found, and not on other errors.

Questions:

  • The parser in Go tzset is strict, failing on any syntax error. The tzcode parser best-effort, accepting as many fields as it can parse and discarding the rest. Should the Go parser be changed accordingly?

    • Argument against: Garbage in, garbage out.
    • Argument for: Compatibility with the rest of the system. Also, currently LoadLocationFromTZData fails on any error except TZ string errors, even when it calls tzset (to populate the cache).
  • Is it relevant to other OSes?

  • Should time.LoadLocation be changed to accept POSIX style TZ strings, additionally to timezone names? If yes, only on Unix or on other OSes as well?

    • Or should another API function be added?

FWIW, there's a comment near LoadLocation:

// NOTE(rsc): Eventually we will need to accept the POSIX TZ environment
// syntax too, but I don't feel like implementing it today.

TZ string: limits

tzcode allows absolute UTC offsets less than 25 hours (up to 24:59:59), and time in rules less than 168 hours (7 days). The former is a POSIX requirement, the latter a Zoneinfo extension. Go currently allows <168 hours for both. I propose limiting allowed UTC offsets to match those of tzcode.

Optimisation: rules

Rationale: The current caching approach is based on the assumption that most timezone lookups will be for timestamps around the present. In all but two zoneinfo timezones the TZ string apples in the present (late 2023). Most suggestions here are either pure optimisation or moving calculations from lookup time to be done once at load time.

TZ string parsing

After loading a zoneinfo file, the TZ string is kept in the Location struct and is parsed on every non-cached lookup after the last static transition, whether it describes rules or a static zone. Currently, TZ strings in over 2/3 of all unique Zoneinfo locations, including the two most populated ones ("Asia/Shanghai" and "Asia/Kolkata"), specify static zones.

My proposal is:

  • Parse the TZ string at load time.

  • If it describes transition rules, store []rule in the Location structure. Add a *zone pointing to the transition target to the rule structure.

  • If it specifies a static zone, discard it. The last static transition specifies the same zone.

    • When parsing the TZ environment variable, use it to create a fixed location.
  • Detect Zoneinfo version 3 permanent DST zones and treat them as static zones.

Day of week calculation

The only rule kind used in practice is the "M" rule, containing month, week and day of week of the transition. These are used to compute the day of year.

  • Simplify the calculation of day of year by treating week 5 as starting 7 days before the next month instead of looping.

  • Calculate day of year first and add it to the day of week of 1 January in that year instead of using Zeller's congruence.

  • Fix handling of negative years and use simplified Tomohiko Sakamoto's algorithm to calculate the day of week of 1 Jan. Better yet, use absolute day as shown below.

Simplifying the rule structure

  • Remove month and week. At load time, convert month and week to day of year. Add a separate day of week field.

  • Remove rule kind. Use a sentinel day of week value (-1) or a flag for other rule kinds.

  • Add a flag to indicate whether a day should be added during leap years; this needs to be explicit to distinguish between "Sunday in week 4 of February" and "last Sunday of February", and between "J" and DOY rule kinds for day>=59.

  • Convert the time of day to UTC, to avoid subtracting the offset each time.

  • Reorder the rules if DST ends earlier in a year than it begins.

Rule normalisation

  • Normalise rules so that time of day is always non-negative and less than secondsPerDay, and day is always non-negative and, if possible, less than 365.
    • The latter is not possible with DOY rules whose day, after adjustment, is >=365.
    • Additionally, with "M" rules whose adjusted day is 26 to 31 December, the transition will sometimes happen in the next year.
    • Rules resulting in transitions in another year do not occur in Zoneinfo, and other implementations, including tzcode, don't handle them correctly, but we can (see below).

With normalised rules the transition happens between year days day and day + 7, inclusive (adding 0-1 days for leap years and 0-6 days for day of week). Without it, between day - 14 and day + 21 (also adding -14 to 14 days for UTC offset and transition time).

Code

After implementing all of the above, and changing tzruleTime to accept the return values of dayOfEpoch(year) and isLeap(year) instead of year and return Unix time, it looks like this (with comments stripped):

func tzruleTime(yearStartDay uint64, r rule, leapYear bool) int64 {
	d := int(yearStartDay) + r.day
	if leapYear && r.addLeapDay {
		d++
	}
	if r.dow >= 0 {
		delta := (d - r.dow) % 7
		d += 6 - delta
	}
	return int64(d*secondsPerDay+r.time) + absoluteToInternal + internalToUnix
}

Zone boundaries

lookup returns the timespan when the zone applies (start and end), used:

  • to populate the lookup cache while creating a Location;
  • in Date, to avoid the second lookup in most cases;
  • in Time.ZoneBounds, essentially as return values.

Currently, if the zone spans a new year, tzset returns the new year instead of one of the values, to limit the number of transition time calculations to two. This only affects efficiency in the first two cases, but in the last case it affects correctness.

If the optimisations above are applied, the following algorithm results in two transition time computations, except when second transition in the previous year occurs past the end of the year and past the target time, in which case (that never happens in Zoneinfo) it's three computations:

  • Use the yday result from the call to absDate (the year day of sec, the target time). If it's before the day of the second rule, compute the time of the first transition, otherwise of the second.

  • If sec is before the result, compute the time of the previous transition. Repeat while sec is before the result (i.e., possibly once more).

  • Otherwise, compute the time of the next transition.

Optimisation: lookup

  • Most lookups are for times after the last static transition. Check it before searching.

  • For locations without static transitions a fake transition is created at the beginning of time. Do it for all locations to eliminate a rarely occuring special case during lookup. Call (*Location).lookupFirstZone from LoadLocationFromTZData to determine the transition target.

    • Alternatively: do it only for locations with static transitions. Fully static locations (like "Etc/GMT-1") will have the only zone cached anyway, for others use rules.

Avoid code duplication

  • Unify the code in LoadLocationFromTZData that fills the cache with lookup.
    • Possibly: change lookup to return *zone instead of name, offset and isDST. This would not make sense with the existing TZ string handling code, but does with proposed changes.

Limitations

The proposed implementations of tzruleTime and lookup may return incorrect results in the following cases:

  • Calculation may overflow in the last year before Unix time math.MaxInt64 (existing limitation).

  • Wrong results may be returned for years below absoluteZeroYear (existing limitation).

  • Result may be one week off in absoluteZeroYear for "M" rules whose adjusted day is before 7 January (does not occur in Zoneinfo).

  • Results will be unpredictable if the transitions occur in different order in different years or simultaneously, e.g., 4 April 2:00 UTC and first Sunday of April 2:00 UTC (existing limitation but different failures; does not occur in Zoneinfo).

Resulting speed-up

I wrote benchmarks that load testdata/2020b_Europe_Berlin, create a Time value and run Hour in a loop. The Time is one of:

  • 2020-10-29 15:30 (cache miss, TZ string / rules)
  • 1980-10-29 15:30 (cache miss, searching 60 static transitions)
  • Now (cache hit)

With optimisations above applied to master (commit 505dff4), the results are:

  • Lookup using rules is 4 to 4.5 times faster than in master.
  • It is over 2 times slower than searching static transitions, and about 9 times slower than hitting the cache.
  • Other kinds of lookups stayed about as fast as in master.

The benchmarks were run in an uncontrolled environment, so I can't give you more precise results.

Timezone abbreviations allocation

Change LoadLocationFromTZData and abbrevChars to allocate one string for all the chars except trailing NUL and cut abbrevs from it, instead of many strings of 3 to 6 bytes. Especially useful with locations having several zones with the same name (e.g., Europe/Dublin has three zones named "IST") and America/Adak that has "HST" encoded as a substring of "AHST".

ZONEINFO environment variable

If ZONEINFO is set, LoadLocation tries to load the named zoneinfo file from the path specified by it. This should probably be added to initLocal in src/time/zoneinfo_unix.go for consistensy. tzcode does not use this variable.

(Tentative) Optimisation: caching

When a location is loaded, the zone valid now is cached in the Location structure to be used in subsequent lookups. This is good for most uses, but in long running processes (such as servers) lookups will slow down after the next transition.

Tentative proposal:

Cache the last 1 or 2 lookup result as well. Alternatively, only cache last lookup results. Caching 2 last lookups is useful for conversions to UTC (e.g., in Date) around a transition; caching more will add too much overhead.

Downside: this will require a sync.RWMutex and taking a read lock on every lookup that misses the "now" cache. A compromise would be calling TryRLock and only writing back the result if locking succeeded.

This is useful in particular scenarios, such as a mail server serialising "now" for "Received:" headers, but detrimental in others.

@unixdj unixdj changed the title proposal: import/path: proposal title proposal: time: POSIX style TZ strings on Unix and timezone handling optimisations Dec 11, 2023
@gopherbot gopherbot added this to the Proposal milestone Dec 11, 2023
unixdj added a commit to unixdj/go that referenced this issue Dec 19, 2023
This commit implements some of the changes outlined in proposal golang#64659.

To optimise subsequent rule time computations, the rule structure is
converted to: year day, week day, time of day and a flag indicating
whether to add a day during leap years.

At load time:

- In "M" rules, month and week are converted to year day and leap day
  flag.
- Transition time is converted to UTC.
- Rule is normalised so that time of day is non-negative and less than
  secondsPerDay, and year day is non-negative and, if possible, less
  than 365.
- Rules in the Location structure are reordered in the order they occur
  in a year.

The internal API of tzruleTime is changed to speed up week day
calculation.

Additionally, tzrule (and thus Time.ZoneBounds) now returns correct zone
bounds for zones spanning a new year.
@unixdj
Copy link
Author

unixdj commented Dec 19, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Incoming
Development

No branches or pull requests

3 participants