New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log/slog: JSONHandler should deduplicate keys #59365
Comments
Whatever solution is chosen here, I would argue that it should work even if the user explicitly provides duplicate names: logger.LogAttrs(
context.Background(),
slog.LevelWarn,
"fancy message",
slog.String("duplicate", "alpha"),
slog.String("duplicate", "bravo"),
slog.String("duplicate", "charlie"),
slog.String("duplicate", "delta"),
slog.String("duplicate", "echo"),
) One possible solution is to use a "#%d" scheme similar to how {
"time": "2023-04-01T04:27:46.0959443-06:00",
"level": "WARN",
"msg": "fancy message",
"duplicate": "alpha",
"duplicate#01": "bravo",
"duplicate#02": "charlie",
"duplicate#03": "delta",
"duplicate#04": "echo",
} |
I think the current situation is fine, duplicate keys aren't invalid, just less interoperable. As for the original issue, I'll note that option 3 was why WithGroup was added (though this doesn't affect deduplication in general). |
Technically you are correct. Under RFC 8259, section 4, duplicate names are essentially treated as undefined behavior. In a possible future major version of "encoding/json", we will likely make duplicate keys an explicit error (which is still complaint behavior with RFC 8259) as this lack of checking has been exploited, resulting in severe security vulnerabilities (e.g., CVE-2017-12635). It's unlikely that duplicate names will have such drastic security implications in logging, but strict checking of duplicate names is likely the overall direction that JSON will gradually move towards. As an aside, Tim Bray, the author of RFC 8259 actually recommends against naïve compliance with RFC 8259, but rather that implementations adhere to RFC 7493 which makes strict decisions on issues that RFC 8259 leaves as undefined. |
I think it's important to understand whether encoding/json v2 with its currently proposed semantics is likely to be the accepted version. We should probably align slog with those semantics, even if slog appears in Go before encoding/json v2. We're already doing that for formatting NaNs, infinities and, probably, durations (#59345). For this issue, if we think no dups will be the default for encoding/json v2, then I wouldn't be opposed to making it the default in slog and adding an |
That I can't really answer unfortunately. We haven't even started the GitHub discussion on anything related to v2 "encoding/json" and may not be ready to do so for some months. A significant piece of work that remains is figuring out the inter-compatibility story between v1 and a hypothetical v2. That's perhaps the hardest part and requires quite a lot of head-scratching. That said, if there is a v2 "encoding/json", I feel like the security arguments for rejecting duplicate names by default is pretty compelling. The (very) tentative plan for v2 is that v1 "encoding/json" will actually be implemented in terms of v2 with the appropriate number of options specified that makes it behave semantically identically to what v1 "json" does today. Essentially, with the appropriate set of options, you can configure the semantics to be any hybrid mix that is in-between the default behavior of v1 and the default behavior of v2. Since "log/slog" already depends on the v1 semantics of "encoding/json", you could imagine doing nothing to address this issue for now. I don't think we should block adoption of "slog" on a hypothetical v2 "json". Nor should we overly massage the output of "slog" to match a hypothetical v2 "json". Supposing v2 "json" one day lands, you could imagine that v2 "json" be integrated into "slog" by exposing something like: package slog
type HandlerOptions struct {
// JSONOptions is a list of Option
// that configures how JSON is marshaled.
JSONOptions []json.Option
} To log each message, "slog" would do something like: // Start with DefaultV1Options since "slog" has historically
// marshaled with mostly v1 semantics by default.
opts := []json.Option{json.DefaultV1Options}
// We have historically specified Encoder.SetEscapeHTML(false).
// By passing a nil escape function, we achieve the same behavior.
opts = append(opts, json.EscapeRune(nil))
// Append any user-specified options,
// which will override any previously set options.
opts = append(opts, h.opts.JSONOptions...)
return json.Marshal(v, opts...) (You'll note that the options API for "json" above is different than what currently exists in "go-json-experiment/json". I mentioned elsewhere in #59339 that we're considering switching to variadic options as we see it as one of the ways to resolve the v1-to-v2 compatibility tensions). Thus, for someone to use v2 "json" with "slog", they would do something like: slog.HandlerOptions{
JSONOptions: []json.Option{json.DefaultV2Options},
}.NewJSONHandler(...) I should also note that You could also simplify the above by providing One could imagine that "slog" inspects
|
Thanks, @dsnet, that makes a lot of sense. |
It would be very unexpected if repeated calls to Log with the same key Structured logging defines a concept of a log event which is conceptually above any specific encoding like JSON or logfmt or et cetera. That concept of a log event is well-defined by convention, and that definition doesn't accommodate repeated keys. |
Possibly, but there are no good answers when it comes to duplicate names. It would be nice if the API made it statically impossible to have duplicates, Given the possibility of runtime duplicates, your options are:
With option 1, you risk the log parser later not being able to read the message entirely or accurately. Duplicate names result in undefined behavior where the parser might ignore duplicates, only use the first one, only use the last one, or use some merged form of all duplicates. With option 2, you avoid any ambiguities when reading, but are subtly dropping information when writing. With option 3, you avoid any ambiguities as to the meaning when reading, but are changing the meaning when writing. At least the meaning when reading is consistent across all parsers. Option 4a is decent as it allows the caller to at least realize that they're possibly holding the API wrong, but I suspect most people will not check the error of logging calls. Of all the bad options, option 4a with either 1 or 3 is probably the best approach. |
Option 2 is pretty clearly the correct choice.
I'm not sure why anyone would expect a structured logging library to never discard information. Emitting multiple structured log events with the same key is pretty clearly going to resolve to a single "winner" for that key, most likely the last event, but really any event would be acceptably valid. |
That's a matter of opinion. Just as how you find option 3 an odd position, I find option 2 an odd position. Every option has pros and cons.
Any value passed to |
Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event. So, no: it's definitely not the case that calling a structured logging method like LogAttrs can be assumed to produce any specific log output. Structured logging is, hopefully obviously, not equivalent to log.Printf, or fmt.Printf, or anything like that. |
@dsnet For option 4, how would an error be reported? A separate log line at warning level, or some out-of-band channel or something? Personally I would have probably expected option 2, where the latest value wins (except that the Message, Time, Level, and Source fields would always win over any Attribute values). My preferences, would probably be: option 1 would break a lot of log aggregation systems |
By technical definition, I agree. However, in practice, duplicates often occur due to human error. If I make a mistake, I would rather still have all duplicates logged (whether with option 1 or 3). As an aside, we should definitely create a vet check that detects accidental duplicate names.
I was thinking of an explicit error returned by |
I'm not sure how to evaluate whether or not a particular mistake made by consumers of an API is common enough to warrant changing the API to accommodate that mistake. Especially so if that change would violate the intuitive expectations of the API, as is the case here. |
I agree, I think a map is at least very close to the ideal data structure here. Regarding performance, the current implementation of type Buffer []byte It's possible to also pool a type Buffer struct{
buf []byte
keys map[string]bool
group []string
} On freeing the
A contrasting approach: I think a
(I really don't have any opinion on what's necessary here - I don't see much need for deduplication personally, prior art in structured logging in Go doesn't deduplicate AFAIK, and yet the the arguments for deduplication seem pretty reasonable) |
Would that be able to detect duplicates in sub-packages though? I feel like the most common occurrence of duplicates will be adding attributes and telemetry before passing the logger down to libraries or other sub-packages.
Right now, errors coming from Handler are discarded by the most common slog methods (ie: Info, WarnCtx, etc). I certainly wouldn't want to change the signature of Info, WarnCtx, etc, to include returning the error. I suppose it is possible to use a middleware type handler that captures and does something with errors, but that wouldn't be something people would use by default. |
Depends on the complexity of call-tracking. This is a similar problem as vet checks for |
One option might be to rename duplicated keys to |
Another argument against option 2 (which discards data) is security. If untrusted input is able affect what names are present, then that can be used as a mechanism to hide malicious activity, knowing that it will not be logged. IMO, it seems better for readers to possibly choke on the malformed messages, than for the malicious activity to not be properly logged. An human investigator looking at the malformed message can see the duplicate shenanigan at play. |
There are many additional (and unavoidable) security concerns in a structured logging API beyond this one, often related to how values are rendered. It is always an error to pass untrusted input to a structured logger directly. |
Like what? The other security issue I can imagine is a DOS attack where you're tricked into logging something massive, but that's a distinctly different class of security issue from being able to hide activity or perform remote-code-execution.
I'm not sure I understand. Suppose we used a Go map such that you did something like: slog.Info(map[string]any{
"msg": "my wonderful message",
"other": "something else",
}) Yes, you get the benefit of static compilation preventing duplicate map keys (although it can't check the implicit "time" and "level" keys). However, I'm missing the part where the pool helps out. When |
I don't think we need to input maps to achieve option (2) or (3), I think I should clarify. This technique doesn't involve any change that isn't internal to This map can be populated with keys visited when the handler traverses the attributes of the record, writing to output. A couple of quirks are that the traversal is made more interesting and dynamic by things like group Attrs, If this map is populated with visited keys, we can detect whether some key has been committed previously in the log line, and trigger behavior option (2) or (3) as a result. Big picture, I'm not enthusiastic about this as a solution, but I was surprised that it seemed more performant than I would've guessed. |
Consider a type which renders in one way in some version X, and a different (and richer) way in a subsequent version Y. Consider the case where that change results in sensitive information from that type which was previously not logged becoming logged. This happens easily if the type grows a new sensitive field in a new version. Such changes are in practice never considered breaking in terms of versioning. A downstream application which logs a value of that type will have one behavior if it includes version X of the relevant package, and a different behavior when it includes version Y, and that version upgrade is in practice never going to be noticed, or tested, by the downstream application. Printf based logging does not generally have this problem, because it is not conventional to pass values directly as a Printf log arguments. |
fwiw we once used a database driver that logged our database password in plain text when in debug mode, using regular old Printf (it was patched right away) I don't really see any security issues with structured logging that don't already exist with regular logging. And its not like this is log4j and we are executing the statements in the logs or trying resolve remote xml or something. If the data is strings, the worst case I can think of is accidental logging of sensitive data. |
A few points: @veqryn's specific problem, duplicate "source" keys, will be easy to resolve with a ReplaceAttr function once #59280 is approved. You can then distinguish a "source" attribute that comes from the handler by seeing if the value is a In general, if you don't like duplicate keys, you can write a wrapping Handler. That can return an error if you like, but it's worth remembering that logging is often the last resort for debugging in the wild. If you've ever been in a position where you can't debug because you have a picky logger that refuses to output everything it has, you know how frustrating that can be, and how happily you'd jettison your principles of careful error detection and failing fast for just a few more characters on the screen. This isn't a proposal, but if it were I'd vote to decline. We can always add dedup functionality later (though not as the default, it's true). As for json v2 compatibility, @dsnet has provided a good blueprint for how to achieve that when the time comes. But you go into the standard library with the packages you have, not the packages you wish you had. |
I arrived here after seeing weird stuff on the logs. I thought that doing something like this I would be able to override the previously set attribute: package main
import (
"log/slog"
"os"
)
func main() {
first := slog.New(slog.NewJSONHandler(os.Stdout, nil)).With("tag", "fooo")
first.Info("hello")
second := first.With("tag", "bar")
second.Info("bye")
}
// {"time":"2009-11-10T23:00:00Z","level":"INFO","msg":"hello","tag":"fooo"}
// {"time":"2009-11-10T23:00:00Z","level":"INFO","msg":"bye","tag":"fooo","tag":"bar"} In our logs, it looks like the fields are being "merged", with parts of words from different attributes. e.g. Is attribute overriding supported? I wanted to use it to change the "component" name in the logs. Something like "foo-handler", "foo-service" etc. |
Scrambled text could result from racing writes to os.Stdout. IIRC this shouldn’t happen with the code you’ve listed(?) but just to toss out an idea quickly. edit: As far as attribute overriding, there's no deduplication in the provided handlers. I think @jba was writing a |
@AndrewHarrisSPU indeed, I was creating multiple handlers in a middleware, so each request had its own mutex to |
Thanks @AndrewHarrisSPU for that quick and accurate diagnosis. @dbofmmbt, there's nothing we can do about that multiple-mutex issue, unfortunately. In fact the original About deduping or overriding, I don't think the default handlers will add it. It's expensive, especially combined with pre-formatting. You can easily write your own wrapping Handler whose WithAttrs method keeps a map, so calls to Logger.With will override previous values. You'll give up pre-formatting (since you can't format anything in that map until |
@codeliger, I'll let others speak to the utility of the built-in JSON handler. But duplicate keys are valid JSON, according to the RFC. |
@jba would you mind informing the section that the RFC specifies duplicated fields? I only found this:
This piece is not encouraging in regards to generating duplicated fields in the stdlib's JSONHandler, IMO, since its behavior looks unspecified. |
@dbofmmbt "unpredictable" is still standardised. The same way like - what if you have a 50 digit long integer. It's standard compliant, yet not every parser would be able to parse it or even when they do - they might interpret it differently. Also this: https://github.com/nst/JSONTestSuite |
That's exactly it, it's unspecified by JSON: The RFC qualifies that:
with SHOULD more precisely defined:
Relative to other Go logging libraries: duplicate keys are possible with |
Hey all, I've created a logging handler middleware that will deduplicate attribute keys, before they go to the json logger.
package main
import (
"log/slog"
"os"
slogdedup "github.com/veqryn/slog-dedup"
)
func main() {
// OverwriteHandler
overwriter := slogdedup.NewOverwriteHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
slog.SetDefault(slog.New(overwriter))
/*
{
"time": "2023-10-03T01:30:00Z",
"level": "INFO",
"msg": "this is the dedup overwrite handler",
"duplicated": "two"
}
*/
slog.Info("this is the dedup overwrite handler",
slog.String("duplicated", "zero"),
slog.String("duplicated", "one"),
slog.String("duplicated", "two"),
)
// IgnoreHandler
ignorer := slogdedup.NewIgnoreHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
slog.SetDefault(slog.New(ignorer))
/*
{
"time": "2023-10-03T01:30:00Z",
"level": "INFO",
"msg": "this is the dedup ignore handler",
"duplicated": "zero"
}
*/
slog.Info("this is the dedup ignore handler",
slog.String("duplicated", "zero"),
slog.String("duplicated", "one"),
slog.String("duplicated", "two"),
)
// IncrementHandler
incrementer := slogdedup.NewIncrementHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
slog.SetDefault(slog.New(incrementer))
/*
{
"time": "2023-10-03T01:30:00Z",
"level": "INFO",
"msg": "this is the dedup incrementer handler",
"duplicated": "zero",
"duplicated#01": "one",
"duplicated#02": "two"
}
*/
slog.Info("this is the dedup incrementer handler",
slog.String("duplicated", "zero"),
slog.String("duplicated", "one"),
slog.String("duplicated", "two"),
)
// AppendHandler
appender := slogdedup.NewAppendHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
slog.SetDefault(slog.New(appender))
/*
{
"time": "2023-10-03T01:30:00Z",
"level": "INFO",
"msg": "this is the dedup appender handler",
"duplicated": [
"zero",
"one",
"two"
]
}
*/
slog.Info("this is the dedup appender handler",
slog.String("duplicated", "zero"),
slog.String("duplicated", "one"),
slog.String("duplicated", "two"),
)
} @jba can I get this added to the slog wiki please? https://github.com/golang/go/wiki/Resources-for-slog |
IMO it should be one of the Because the ship has sailed and |
When/if |
@jba Sorry, I wasn't clear. Most probably it will be written in C/C++/Rust/Go, but you know, it's just an example. And I'm also not very aware of the JSON lib situation in those ecosystems. With the logger chaining pattern, like this go.dev/play it's easy to screw up. I'm migrating our codebase from logrus to slog, and I'm really not sure if our codebase is "correct", so I'm adding the I'm not worried about the logs not reach stdout, but our Grafana. |
Yes, that's definitely what I'm doing at this moment, with tuning HandlerOptions.ReplaceAttr to catch some other quirks. I just thought it would be worth having in the std lib. It's a good safety net, if people understand the performance implications. |
Duplicate keys can lost if the processor doesn't accept duplicates, but they can also be lost if the processor expects |
@carlmjohnson Not keys, the whole log line can be dropped as an invalid JSON, if the JSON parser considers duplicate keys as an invalid JSON object Edit: |
@proachac. There are no good answers when it comes to duplicate names. You mentioned that you view your logs through Grafana. What is the underlying data store that you're using with Grafana, and does that at least store JSON objects with duplicate names? In particular, I'm trying to understand whether this a storage problem or a reading problem:
|
@dsnet I do not have a problem particularly. What I have is concern. In an imaginary scenario. ECMA says that the latest wins
|
Are there any other commonly-used JSON parsers that will fail to parse |
JavaScript, the father of JSON si definitely not. Maybe it's not a good strategy to be more Catholic than the Pope. |
In "encoding/json/v2", we are aiming to reject duplicate names by default, but make it fairly easy to opt-out of the strict check. The decision to reject duplicate names is driven by various security exploits that have occurred over the years that depended on abusing the unspecified behavior that occurs with duplicate names (across a variety of JSON implementations in various languages). See CVE-2017-12635 as one such example. This particular security stance doesn't benefit logging as much, since the content of a log message is rarely used to alter some critical piece of control flow. However, JSON is used for so much more than just logging (e.g., RPCs), that it is safer to be secure by default even if it is the more annoying default for logs analysis. Also, even if various libraries allow for duplicate names, the behavior of how duplicates are handled varies widely. It is not enough to just ask the question of whether it is permissible. In the CVE mentioned above, Erland and JavaScript disagreed about the semantic meaning of a JSON object with duplicate names. In a similar way, permitting duplicate names, means that you may end up with logs analysis results that differ (perhaps wildly) depending on what JSON implementation you happened to use. |
I think this is not just a JSON topic. The log/slog package should be explicit about the semantics of duplicate keys provided via log.Log(…) or by having duplicates via log.With(…).Log(…). With a well-defined semantics each Handler must find its strategy to implement it. Otherwise, the information in a log would depend on the choice of the log Handler – not on what was provided by the application that uses the logger. |
I stumbled onto this problem while I was trying to add some My personal feeling is that duplicate fields in JSON are bad (many arguments above) and should be documented as such. The greatest advantage of The simplest solution is #2 latest, though there may be unpredictability in "latest" in some situations. Or perhaps, if field "alpha" repeats, turn it into an array of all the values. In any event, as mentioned earlier, the |
What version of Go are you using (
go version
)?8edcddd
Does this issue reproduce with the latest release?
yes
What did you do?
What did you expect to see?
JSONHandler should be fixed by deduplicating those special keys (time, message, level, and source).
I use "source" as a key in a lot of my code right now, so I'd run into this problem.
Either
Or
Or
Or
What did you see instead?
Invalid json. It would break our log aggregation. Default/Builtin handlers should never create invalid output.
@jba
EDIT:
Hey all, I've created a logging handler middleware that will deduplicate attribute keys, before they go to the json logger.
Please check it out: https://github.com/veqryn/slog-dedup
go get github.com/veqryn/slog-dedup
The text was updated successfully, but these errors were encountered: