Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log/slog: JSONHandler should deduplicate keys #59365

Open
veqryn opened this issue Apr 1, 2023 · 49 comments
Open

log/slog: JSONHandler should deduplicate keys #59365

veqryn opened this issue Apr 1, 2023 · 49 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@veqryn
Copy link
Contributor

veqryn commented Apr 1, 2023

What version of Go are you using (go version)?

8edcddd

Does this issue reproduce with the latest release?

yes

What did you do?

	logger := slog.New(slog.NewJSONHandler(os.Stdout))
	logger.LogAttrs(
		context.Background(),
		slog.LevelWarn,
		"fancy message",
		slog.Bool("msg", false),
		slog.String("my attr", "something"),
	)

What did you expect to see?

JSONHandler should be fixed by deduplicating those special keys (time, message, level, and source).
I use "source" as a key in a lot of my code right now, so I'd run into this problem.

Either

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"my attr": "something"
}

Or

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"msg.msg": false,
	"my attr": "something"
}

Or

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"fields": {
		"msg": false,
		"my attr": "something"
	}
}

Or

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"msg#01": false,
	"my attr": "something"
}

What did you see instead?

Invalid json. It would break our log aggregation. Default/Builtin handlers should never create invalid output.

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"msg": false,
	"my attr": "something"
}

@jba

EDIT:
Hey all, I've created a logging handler middleware that will deduplicate attribute keys, before they go to the json logger.
Please check it out: https://github.com/veqryn/slog-dedup

go get github.com/veqryn/slog-dedup

package main

import (
	"log/slog"
	"os"

	slogdedup "github.com/veqryn/slog-dedup"
)

func main() {
	// OverwriteHandler
	overwriter := slogdedup.NewOverwriteHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(overwriter))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup overwrite handler",
	    "duplicated": "two"
	  }
	*/
	slog.Info("this is the dedup overwrite handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IgnoreHandler
	ignorer := slogdedup.NewIgnoreHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(ignorer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup ignore handler",
	    "duplicated": "zero"
	  }
	*/
	slog.Info("this is the dedup ignore handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IncrementHandler
	incrementer := slogdedup.NewIncrementHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(incrementer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup incrementer handler",
	    "duplicated": "zero",
	    "duplicated#01": "one",
	    "duplicated#02": "two"
	  }
	*/
	slog.Info("this is the dedup incrementer handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// AppendHandler
	appender := slogdedup.NewAppendHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(appender))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup appender handler",
	    "duplicated": [
	      "zero",
	      "one",
	      "two"
	    ]
	  }
	*/
	slog.Info("this is the dedup appender handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)
}
@seankhliao seankhliao changed the title affected/package: log/slog: JSONHandler easily creates invalid json log/slog: JSONHandler should deduplicate keys Apr 1, 2023
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 1, 2023
@seankhliao seankhliao added this to the Go1.21 milestone Apr 1, 2023
@dsnet
Copy link
Member

dsnet commented Apr 3, 2023

Whatever solution is chosen here, I would argue that it should work even if the user explicitly provides duplicate names:

logger.LogAttrs(
	context.Background(),
	slog.LevelWarn,
	"fancy message",
	slog.String("duplicate", "alpha"),
	slog.String("duplicate", "bravo"),
	slog.String("duplicate", "charlie"),
	slog.String("duplicate", "delta"),
	slog.String("duplicate", "echo"),
)

One possible solution is to use a "#%d" scheme similar to how testing.T.Run deduplicates conflicting test names.
For example:

{
	"time": "2023-04-01T04:27:46.0959443-06:00",
	"level": "WARN",
	"msg": "fancy message",
	"duplicate": "alpha",
	"duplicate#01": "bravo",
	"duplicate#02": "charlie",
	"duplicate#03": "delta",
	"duplicate#04": "echo",
}

@seankhliao
Copy link
Member

I think the current situation is fine, duplicate keys aren't invalid, just less interoperable.
Changing the key name to deduplicate it seems worse.

As for the original issue, I'll note that option 3 was why WithGroup was added (though this doesn't affect deduplication in general).

@dsnet
Copy link
Member

dsnet commented Apr 3, 2023

duplicate keys aren't invalid, just less interoperable.

Technically you are correct. Under RFC 8259, section 4, duplicate names are essentially treated as undefined behavior.

In a possible future major version of "encoding/json", we will likely make duplicate keys an explicit error (which is still complaint behavior with RFC 8259) as this lack of checking has been exploited, resulting in severe security vulnerabilities (e.g., CVE-2017-12635). It's unlikely that duplicate names will have such drastic security implications in logging, but strict checking of duplicate names is likely the overall direction that JSON will gradually move towards.

As an aside, Tim Bray, the author of RFC 8259 actually recommends against naïve compliance with RFC 8259, but rather that implementations adhere to RFC 7493 which makes strict decisions on issues that RFC 8259 leaves as undefined.

@jba
Copy link
Contributor

jba commented Apr 3, 2023

I think it's important to understand whether encoding/json v2 with its currently proposed semantics is likely to be the accepted version. We should probably align slog with those semantics, even if slog appears in Go before encoding/json v2. We're already doing that for formatting NaNs, infinities and, probably, durations (#59345).

For this issue, if we think no dups will be the default for encoding/json v2, then I wouldn't be opposed to making it the default in slog and adding an AllowDuplicateNames option. And we should do the same for TextHandler for consistency.

@dsnet
Copy link
Member

dsnet commented Apr 3, 2023

whether encoding/json v2 with its currently proposed semantics is likely to be the accepted version

That I can't really answer unfortunately.

We haven't even started the GitHub discussion on anything related to v2 "encoding/json" and may not be ready to do so for some months. A significant piece of work that remains is figuring out the inter-compatibility story between v1 and a hypothetical v2. That's perhaps the hardest part and requires quite a lot of head-scratching.

That said, if there is a v2 "encoding/json", I feel like the security arguments for rejecting duplicate names by default is pretty compelling.


The (very) tentative plan for v2 is that v1 "encoding/json" will actually be implemented in terms of v2 with the appropriate number of options specified that makes it behave semantically identically to what v1 "json" does today. Essentially, with the appropriate set of options, you can configure the semantics to be any hybrid mix that is in-between the default behavior of v1 and the default behavior of v2.

Since "log/slog" already depends on the v1 semantics of "encoding/json", you could imagine doing nothing to address this issue for now. I don't think we should block adoption of "slog" on a hypothetical v2 "json". Nor should we overly massage the output of "slog" to match a hypothetical v2 "json".

Supposing v2 "json" one day lands, you could imagine that v2 "json" be integrated into "slog" by exposing something like:

package slog

type HandlerOptions struct {
    // JSONOptions is a list of Option
    // that configures how JSON is marshaled.
    JSONOptions []json.Option
}

To log each message, "slog" would do something like:

// Start with DefaultV1Options since "slog" has historically
// marshaled with mostly v1 semantics by default.
opts := []json.Option{json.DefaultV1Options}
// We have historically specified Encoder.SetEscapeHTML(false).
// By passing a nil escape function, we achieve the same behavior.
opts = append(opts, json.EscapeRune(nil))
// Append any user-specified options,
// which will override any previously set options.
opts = append(opts, h.opts.JSONOptions...)
return json.Marshal(v, opts...)

(You'll note that the options API for "json" above is different than what currently exists in "go-json-experiment/json". I mentioned elsewhere in #59339 that we're considering switching to variadic options as we see it as one of the ways to resolve the v1-to-v2 compatibility tensions).
(Also ignore the performance implications of appending a bunch of options right before calling json.Marshal; we can find ways to to optimize this. Let's focus on semantics.)

Thus, for someone to use v2 "json" with "slog", they would do something like:

slog.HandlerOptions{
    JSONOptions: []json.Option{json.DefaultV2Options},
}.NewJSONHandler(...)

I should also note that DefaultV1Options and DefaultV2Options represent a composite set of options. For example DefaultV1Options includes AllowDuplicateNames(true) while DefaultV2Options includes AllowDuplicateNames(false).

You could also simplify the above by providing NewJSONv2Handler that implicitly inserts json.DefaultV2Options before user-specified options.

One could imagine that "slog" inspects HandlerOptions.JSONOptions and checks AllowDuplicateNames:

@jba
Copy link
Contributor

jba commented Apr 4, 2023

Thanks, @dsnet, that makes a lot of sense.

@08d2
Copy link

08d2 commented Apr 5, 2023

It would be very unexpected if repeated calls to Log with the same key "duplicate" were to produce output like {"duplicate": "alpha", "duplicate#01": "bravo", "duplicate#02": "charlie"}. Among many other reasons, this behavior would make it impossible to reliably assert that a log event emitted with some key K would actually be rendered with that key K in the ultimate output format.

Structured logging defines a concept of a log event which is conceptually above any specific encoding like JSON or logfmt or et cetera. That concept of a log event is well-defined by convention, and that definition doesn't accommodate repeated keys.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

It would be very unexpected if repeated calls to Log with the same key "duplicate" were to produce output like {"duplicate": "alpha", "duplicate#01": "bravo", "duplicate#02": "charlie"}.

Possibly, but there are no good answers when it comes to duplicate names.

It would be nice if the API made it statically impossible to have duplicates,
but I don't see how that can be done without taking in a map[string]any as the input,
which would not be performant.

Given the possibility of runtime duplicates, your options are:

  1. Emit the object with all duplicate fields
  2. Emit the object, but only keep one of the duplicate fields (e.g., the first occurrence)
  3. Rename the duplicate fields to avoid a conflict
  4. Report an error
    a. and still do either 1, 2, or 3
    b. or simply reject the message and not log it.

With option 1, you risk the log parser later not being able to read the message entirely or accurately. Duplicate names result in undefined behavior where the parser might ignore duplicates, only use the first one, only use the last one, or use some merged form of all duplicates.

With option 2, you avoid any ambiguities when reading, but are subtly dropping information when writing.

With option 3, you avoid any ambiguities as to the meaning when reading, but are changing the meaning when writing. At least the meaning when reading is consistent across all parsers.

Option 4a is decent as it allows the caller to at least realize that they're possibly holding the API wrong, but I suspect most people will not check the error of logging calls.

Of all the bad options, option 4a with either 1 or 3 is probably the best approach.
Options 2 and 4b discard information, which seems antithetical to the point of a logging library.

@08d2
Copy link

08d2 commented Apr 5, 2023

Option 2 is pretty clearly the correct choice.

Options 2 and 4b discard information, which seems antithetical to the point of a logging library.

I'm not sure why anyone would expect a structured logging library to never discard information. Emitting multiple structured log events with the same key is pretty clearly going to resolve to a single "winner" for that key, most likely the last event, but really any event would be acceptably valid.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

Option 2 is pretty clearly the correct choice.

That's a matter of opinion. Just as how you find option 3 an odd position, I find option 2 an odd position. Every option has pros and cons.

I'm not sure why anyone would expect a structured logging library to never discard information.

Any value passed to LogAttrs by the programmer carries fairly clear intent that it be logged. Arbitrarily choosing some field to keep, while discarding all others goes against the intent. Thus, my expectation is for a logging library to try hard to never discard information.

@08d2
Copy link

08d2 commented Apr 5, 2023

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event. So, no: it's definitely not the case that calling a structured logging method like LogAttrs can be assumed to produce any specific log output. Structured logging is, hopefully obviously, not equivalent to log.Printf, or fmt.Printf, or anything like that.

@veqryn
Copy link
Contributor Author

veqryn commented Apr 5, 2023

@dsnet For option 4, how would an error be reported? A separate log line at warning level, or some out-of-band channel or something?

Personally I would have probably expected option 2, where the latest value wins (except that the Message, Time, Level, and Source fields would always win over any Attribute values).

My preferences, would probably be:
option 2 or 4a2
option 3 or 4a3

option 1 would break a lot of log aggregation systems
option 4b would seem like the worst option, as you lose the log line completely, making any error even harder to diagnose

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event

By technical definition, I agree. However, in practice, duplicates often occur due to human error. If I make a mistake, I would rather still have all duplicates logged (whether with option 1 or 3).

As an aside, we should definitely create a vet check that detects accidental duplicate names.

@dsnet For option 4, how would an error be reported? A separate log line at warning level, or some out-of-band channel or something?

I was thinking of an explicit error returned by slog.Handler.Handle. I guess a separate log line is a variation of option 4. However, a go vet check might be a better way to statically detect most of these cases.

@08d2
Copy link

08d2 commented Apr 5, 2023

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event

By technical definition, I agree. However, in practice, duplicates often occur due to human error. If I make a mistake, I would rather still have all duplicates logged (whether with option 1 or 3).

I'm not sure how to evaluate whether or not a particular mistake made by consumers of an API is common enough to warrant changing the API to accommodate that mistake. Especially so if that change would violate the intuitive expectations of the API, as is the case here.

@AndrewHarrisSPU
Copy link

It would be nice if the API made it statically impossible to have duplicates, but I don't see how that can be done without taking in a map[string]any as the input, which would not be performant.

I agree, I think a map is at least very close to the ideal data structure here. Regarding performance, the current implementation of x/exp/slog/internal/buffer allocates output buffers as a sync.Pool resource.

type Buffer []byte

It's possible to also pool a map[string]bool for keys, and a []string for groups

type Buffer struct{
	buf []byte
	keys map[string]bool
	group []string
}

On freeing the Buffer, like the buf slice, the keys and group fields can be emptied without releasing backing memory. I used something similar experimenting with string interpolation of attributes into log messages which was fun and way more useless in practice than I anticipated, but I did benchmark and this Buffer approach seemed to:

  • not allocate
  • increase the time of logging proportional to the number of attributes (loosely and non-scientifically, each Attr was 20% more expensive - use 5 Attrs and pay for 6)
  • increase the size of pooled objects by a constant factor - the additional fields are small, but so is the buf slice.

A contrasting approach: I think a Deduplicate wrapper for ReplaceAttr would be possible without any change to slog, although tricky because deduplication after ReplaceAttr seems significant. Notably, with no change to slog, I don't think this allows for your option 4 of reporting an error, which I thought was intriguing. Something like:

func Deduplicate(func([]string, Attr)(Attr)) func([]string, Attr) Attr

(I really don't have any opinion on what's necessary here - I don't see much need for deduplication personally, prior art in structured logging in Go doesn't deduplicate AFAIK, and yet the the arguments for deduplication seem pretty reasonable)

@veqryn
Copy link
Contributor Author

veqryn commented Apr 5, 2023

Values passed to a structured logging library necessarily include a key which is, by definition, understood as unique within the target log event

By technical definition, I agree. However, in practice, duplicates often occur due to human error. If I make a mistake, I would rather still have all duplicates logged (whether with option 1 or 3).

As an aside, we should definitely create a vet check that detects accidental duplicate names.

Would that be able to detect duplicates in sub-packages though? I feel like the most common occurrence of duplicates will be adding attributes and telemetry before passing the logger down to libraries or other sub-packages.

@dsnet For option 4, how would an error be reported? A separate log line at warning level, or some out-of-band channel or something?

I was thinking of an explicit error returned by slog.Handler.Handle. I guess a separate log line is a variation of option 4. However, a go vet check might be a better way to statically detect most of these cases.

Right now, errors coming from Handler are discarded by the most common slog methods (ie: Info, WarnCtx, etc). I certainly wouldn't want to change the signature of Info, WarnCtx, etc, to include returning the error. I suppose it is possible to use a middleware type handler that captures and does something with errors, but that wouldn't be something people would use by default.
In any case, I would still want the log line outputted before any error returned.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

Would that be able to detect duplicates in sub-packages though?

Depends on the complexity of call-tracking. This is a similar problem as vet checks for fmt.Sprintf, which already has some degree of call tracking. I don't know if it descends through package boundaries, but it is able to cross function call boundaries.

@zephyrtronium
Copy link
Contributor

One option might be to rename duplicated keys to !BADKEY(duplicateKey) as a sort of combination of @dsnet's options 3 and 4a while carrying some symmetry with the existing error-while-logging mechanism. Then again, there is also some asymmetry if it is only JSONHandler that checks duplicates.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

Another argument against option 2 (which discards data) is security.

If untrusted input is able affect what names are present, then that can be used as a mechanism to hide malicious activity, knowing that it will not be logged. IMO, it seems better for readers to possibly choke on the malformed messages, than for the malicious activity to not be properly logged. An human investigator looking at the malformed message can see the duplicate shenanigan at play.

@08d2
Copy link

08d2 commented Apr 5, 2023

There are many additional (and unavoidable) security concerns in a structured logging API beyond this one, often related to how values are rendered. It is always an error to pass untrusted input to a structured logger directly.

@dsnet
Copy link
Member

dsnet commented Apr 5, 2023

@08d2

There are many additional (and unavoidable) security concerns in a structured logging API beyond this one

Like what? The other security issue I can imagine is a DOS attack where you're tricked into logging something massive, but that's a distinctly different class of security issue from being able to hide activity or perform remote-code-execution.

@AndrewHarrisSPU

I think a map is at least very close to the ideal data structure here. Regarding performance, the current implementation of x/exp/slog/internal/buffer allocates output buffers as a sync.Pool resource.

I'm not sure I understand. Suppose we used a Go map such that you did something like:

slog.Info(map[string]any{
    "msg":   "my wonderful message",
    "other": "something else",
})

Yes, you get the benefit of static compilation preventing duplicate map keys (although it can't check the implicit "time" and "level" keys). However, I'm missing the part where the pool helps out.

When slog.Info calls the underlying slog.Handler.Handle, it goes through an interface method call, and all escape analysis goes out the window, so the input map is treated as having escaped (and thus heap-allocated). To ensure the input remains stack allocated, you would have to copy the input map to one of the pooled maps and then call Handler.Handle with it. It's not clear to me that the act of copying the map is going to be fast enough.

@AndrewHarrisSPU
Copy link

AndrewHarrisSPU commented Apr 5, 2023

@dsnet

When slog.Info calls the underlying slog.Handler.Handle, it goes through an interface method call, and all escape analysis goes out the window, so the input map is treated as having escaped (and thus heap-allocated). To ensure the input remains stack allocated, you would have to copy the input map to one of the pooled maps and then call Handler.Handle with it. It's not clear to me that the act of copying the map is going to be fast enough.

I don't think we need to input maps to achieve option (2) or (3), I think I should clarify. This technique doesn't involve any change that isn't internal to Handler implementation, everything else looks the same as before. There's no map copying; there is one blank, probably already-allocated map obtained from a pool when the output Buffer is obtained by Handler.Handle. (edit: some attention to pre-formatted attributes is needed here, but it can be pretty efficiently done)

This map can be populated with keys visited when the handler traverses the attributes of the record, writing to output. A couple of quirks are that the traversal is made more interesting and dynamic by things like group Attrs, LogValuers that expand into groups, or ReplaceAttrs, etc. ReplaceAttr needs to happen before a key goes into the map, but this is tractable. Also, precisely, it's not just key strings that need to be matched, it's also the group prefix - a little tricky but possible to do safely and without allocating.

If this map is populated with visited keys, we can detect whether some key has been committed previously in the log line, and trigger behavior option (2) or (3) as a result. Big picture, I'm not enthusiastic about this as a solution, but I was surprised that it seemed more performant than I would've guessed.

@08d2
Copy link

08d2 commented Apr 6, 2023

Like what? The other security issue I can imagine is a DOS attack where you're tricked into logging something massive, but that's a distinctly different class of security issue from being able to hide activity or perform remote-code-execution.

Consider a type which renders in one way in some version X, and a different (and richer) way in a subsequent version Y. Consider the case where that change results in sensitive information from that type which was previously not logged becoming logged. This happens easily if the type grows a new sensitive field in a new version. Such changes are in practice never considered breaking in terms of versioning. A downstream application which logs a value of that type will have one behavior if it includes version X of the relevant package, and a different behavior when it includes version Y, and that version upgrade is in practice never going to be noticed, or tested, by the downstream application.

Printf based logging does not generally have this problem, because it is not conventional to pass values directly as a Printf log arguments.

@veqryn
Copy link
Contributor Author

veqryn commented Apr 7, 2023

fwiw we once used a database driver that logged our database password in plain text when in debug mode, using regular old Printf (it was patched right away)

I don't really see any security issues with structured logging that don't already exist with regular logging.

And its not like this is log4j and we are executing the statements in the logs or trying resolve remote xml or something. If the data is strings, the worst case I can think of is accidental logging of sensitive data.

@jba
Copy link
Contributor

jba commented Apr 8, 2023

A few points:

@veqryn's specific problem, duplicate "source" keys, will be easy to resolve with a ReplaceAttr function once #59280 is approved. You can then distinguish a "source" attribute that comes from the handler by seeing if the value is a *slog.Source. It's also going to appear before a user-supplied attribute, a fact you can use right now. The design doesn't promise that that will always be true, but I can't see why we'd change it.

In general, if you don't like duplicate keys, you can write a wrapping Handler. That can return an error if you like, but it's worth remembering that logging is often the last resort for debugging in the wild. If you've ever been in a position where you can't debug because you have a picky logger that refuses to output everything it has, you know how frustrating that can be, and how happily you'd jettison your principles of careful error detection and failing fast for just a few more characters on the screen.

This isn't a proposal, but if it were I'd vote to decline. We can always add dedup functionality later (though not as the default, it's true). As for json v2 compatibility, @dsnet has provided a good blueprint for how to achieve that when the time comes. But you go into the standard library with the packages you have, not the packages you wish you had. slog should behave like the current encoding/json.

@gopherbot gopherbot modified the milestones: Go1.21, Go1.22 Aug 8, 2023
@dbofmmbt
Copy link

dbofmmbt commented Aug 21, 2023

I arrived here after seeing weird stuff on the logs. I thought that doing something like this I would be able to override the previously set attribute:

package main

import (
	"log/slog"
	"os"
)

func main() {
	first := slog.New(slog.NewJSONHandler(os.Stdout, nil)).With("tag", "fooo")
	first.Info("hello")

	second := first.With("tag", "bar")
	second.Info("bye")
}

// {"time":"2009-11-10T23:00:00Z","level":"INFO","msg":"hello","tag":"fooo"}
// {"time":"2009-11-10T23:00:00Z","level":"INFO","msg":"bye","tag":"fooo","tag":"bar"}

In our logs, it looks like the fields are being "merged", with parts of words from different attributes. e.g. start-activityware from start-activity and logger-middleware. It may be something in our logging pipeline going crazy because of this duplication.

Is attribute overriding supported? I wanted to use it to change the "component" name in the logs. Something like "foo-handler", "foo-service" etc.

@AndrewHarrisSPU
Copy link

AndrewHarrisSPU commented Aug 21, 2023

Scrambled text could result from racing writes to os.Stdout. IIRC this shouldn’t happen with the code you’ve listed(?) but just to toss out an idea quickly.

edit: NewJSONHandler puts a mutex on the writer, but that mutex is only checked by that handler and handlers derived from it. So a, b := slog.New(slog.NewJSONHandler(os.Stdout, nil)), slog.New(slog.NewJSONHandler(os.Stdout, nil)) can have races between a and b. Apologies if this is the wrong track, but might be worth checking.

As far as attribute overriding, there's no deduplication in the provided handlers. I think @jba was writing a Handler guide which would be a great next step, it's definitely possible.

@dbofmmbt
Copy link

@AndrewHarrisSPU indeed, I was creating multiple handlers in a middleware, so each request had its own mutex to os.Stdout. Thank you! It makes a lot of sense.

@jba
Copy link
Contributor

jba commented Aug 29, 2023

Thanks @AndrewHarrisSPU for that quick and accurate diagnosis. @dbofmmbt, there's nothing we can do about that multiple-mutex issue, unfortunately. In fact the original log package has the same issue. I think the best solution is to wrap your io.Writer in one that acquires the mutex during Write, then install the wrapped io.Writer in multiple loggers. A well-behaved handler should call Write once per log event, and the built-in ones do, though that is not a hard requirement.

About deduping or overriding, I don't think the default handlers will add it. It's expensive, especially combined with pre-formatting. You can easily write your own wrapping Handler whose WithAttrs method keeps a map, so calls to Logger.With will override previous values. You'll give up pre-formatting (since you can't format anything in that map until Handle is called) but you may not need the extra speed.

@jba
Copy link
Contributor

jba commented Sep 25, 2023

@codeliger, I'll let others speak to the utility of the built-in JSON handler. But duplicate keys are valid JSON, according to the RFC.

@dbofmmbt
Copy link

dbofmmbt commented Sep 25, 2023

@jba would you mind informing the section that the RFC specifies duplicated fields? I only found this:

When the names within an object are not unique, the behavior of software that receives such an object is unpredictable. Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.

This piece is not encouraging in regards to generating duplicated fields in the stdlib's JSONHandler, IMO, since its behavior looks unspecified.

@zerkms
Copy link

zerkms commented Sep 25, 2023

@dbofmmbt "unpredictable" is still standardised. The same way like - what if you have a 50 digit long integer. It's standard compliant, yet not every parser would be able to parse it or even when they do - they might interpret it differently.

Also this: https://github.com/nst/JSONTestSuite

@AndrewHarrisSPU
Copy link

@dbofmmbt

This piece is not encouraging in regards to generating duplicated fields in the stdlib's JSONHandler, IMO, since its behavior looks unspecified.

That's exactly it, it's unspecified by JSON:

The RFC qualifies that:

The names within an object SHOULD be unique. ...

with SHOULD more precisely defined:

3. SHOULD   This word, or the adjective "RECOMMENDED", mean that there
   may exist valid reasons in particular circumstances to ignore a
   particular item, but the full implications must be understood and
   carefully weighed before choosing a different course.

Relative to other Go logging libraries: duplicate keys are possible with slog, zap and zerolog, and last-seen-wins in logrus. This could be seen as driven by performance concerns, in contrast to the former's (fast) data models, logrus uses (expensive) maps. But it's not the case that logrus is doing something that is more correct according to the JSON specification, the RFC positively identifies it as only one possible implementation, with no correct implementation identified.

@veqryn
Copy link
Contributor Author

veqryn commented Oct 3, 2023

Hey all, I've created a logging handler middleware that will deduplicate attribute keys, before they go to the json logger.
Please check it out: https://github.com/veqryn/slog-dedup
(I'm also looking for constructive criticism, but please do it as an issue in that repo, not here.)

go get github.com/veqryn/slog-dedup

package main

import (
	"log/slog"
	"os"

	slogdedup "github.com/veqryn/slog-dedup"
)

func main() {
	// OverwriteHandler
	overwriter := slogdedup.NewOverwriteHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(overwriter))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup overwrite handler",
	    "duplicated": "two"
	  }
	*/
	slog.Info("this is the dedup overwrite handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IgnoreHandler
	ignorer := slogdedup.NewIgnoreHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(ignorer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup ignore handler",
	    "duplicated": "zero"
	  }
	*/
	slog.Info("this is the dedup ignore handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// IncrementHandler
	incrementer := slogdedup.NewIncrementHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(incrementer))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup incrementer handler",
	    "duplicated": "zero",
	    "duplicated#01": "one",
	    "duplicated#02": "two"
	  }
	*/
	slog.Info("this is the dedup incrementer handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)

	// AppendHandler
	appender := slogdedup.NewAppendHandler(slog.NewJSONHandler(os.Stdout, nil), nil)
	slog.SetDefault(slog.New(appender))

	/*
	  {
	    "time": "2023-10-03T01:30:00Z",
	    "level": "INFO",
	    "msg": "this is the dedup appender handler",
	    "duplicated": [
	      "zero",
	      "one",
	      "two"
	    ]
	  }
	*/
	slog.Info("this is the dedup appender handler",
		slog.String("duplicated", "zero"),
		slog.String("duplicated", "one"),
		slog.String("duplicated", "two"),
	)
}

@jba can I get this added to the slog wiki please? https://github.com/golang/go/wiki/Resources-for-slog
It would also be great if the stdlib json logger mentioned in its docs somewhere that it does not prevent duplicate field/property keys in the json.

@prochac
Copy link

prochac commented Oct 30, 2023

IMO it should be one of the JSONHandler's options. JSON format says it SHOULDN'T contain duplicate keys, but it's not explicitly forbidden.
But, imagine your log processor is written in Go, using the encoding/json/v2 implementation, which is not allowing duplicate keys, and you will lose some logs due to the duplicates. Very probably you will find out in the worst possible moment, looking for the important logs.

Because the ship has sailed and log/slog is in Go 1.21, and also because it has performance implications, it should be opt-in. Normally, I would be for opt-out, because it would mimic the behavior of github.com/sirupsen/logrus, but it's not an option now.

@jba
Copy link
Contributor

jba commented Oct 30, 2023

encoding/json/v2 will have an option for allowing duplicates. See the discussion.

When/if encoding/json/v2 is in the standard library, we will find a way for slog to use it. Maybe we'll have a JSONHandlerV2, or maybe we'll add a list of options to the Options struct.

@prochac
Copy link

prochac commented Oct 31, 2023

@jba Sorry, I wasn't clear.
What I meant is any log processor in any language what doesn't support duplicate keys.
Let's change our example to NodeJS, what seems to not support duplicate keys as well.
https://stackoverflow.com/questions/30842675/how-to-get-the-json-with-duplicate-keys-completely-in-javascript

Most probably it will be written in C/C++/Rust/Go, but you know, it's just an example. And I'm also not very aware of the JSON lib situation in those ecosystems.

With the logger chaining pattern, like this go.dev/play it's easy to screw up.

I'm migrating our codebase from logrus to slog, and I'm really not sure if our codebase is "correct", so I'm adding the dedup.NewOverwriteHandler to achieve the same logrus-like behavior. At least from the start.

I'm not worried about the logs not reach stdout, but our Grafana.

@jba
Copy link
Contributor

jba commented Oct 31, 2023

@prochac Using one of @veqryn's dedup handlers is a good way to address your concerns.

@prochac
Copy link

prochac commented Oct 31, 2023

Yes, that's definitely what I'm doing at this moment, with tuning HandlerOptions.ReplaceAttr to catch some other quirks.

I just thought it would be worth having in the std lib. It's a good safety net, if people understand the performance implications.

@earthboundkid
Copy link
Contributor

Duplicate keys can lost if the processor doesn't accept duplicates, but they can also be lost if the processor expects mykey and gets mykey-1. There's not a best option here.

@prochac
Copy link

prochac commented Oct 31, 2023

@carlmjohnson Not keys, the whole log line can be dropped as an invalid JSON, if the JSON parser considers duplicate keys as an invalid JSON object

Edit:
I wanted to do it later, after app-level migration, but it seems that our stack processes duplicate keys fine. The last occurrence wins.
stack: Vector+Loki

@dsnet
Copy link
Member

dsnet commented Oct 31, 2023

@proachac. There are no good answers when it comes to duplicate names.

You mentioned that you view your logs through Grafana. What is the underlying data store that you're using with Grafana, and does that at least store JSON objects with duplicate names?

In particular, I'm trying to understand whether this a storage problem or a reading problem:

  • A storage problem is one where Prometheus (a common data store for Grafana) refuses to store the log message and drops it entirely. Do we know if Pormetheus and whether common logs data store (e.g., Splunk or DataDog) drop objects with duplicate names? In this situation, there is no hope for recovery since the data is lost forever.
  • A reading problem is one where some application writing in a programming language of choice can't read the data. The default behavior of v2 "json" will reject values with duplicate names, but this can be easily disabled with json.Unmarshal(b, &v, jsontext.AllowDuplicateNames(true)). That said, the semantic behavior of duplicate names will differ across different implementations. In this situation, the situation is bad, but at least there is hope for recovering the logs data.

@prochac
Copy link

prochac commented Oct 31, 2023

@dsnet I do not have a problem particularly. What I have is concern. In an imaginary scenario.
log/slog produces JSON logs how they SHOULDN'T (not MUSN'T tho) look like by RFC8259.
I know, it's the user's fault, but...
That's why I suggested that it would be worth of being in std lib.
I know, I can use the 3rd party library, and I'm doing it. At least, until we fully migrate and verify that everything is working just fine.

ECMA says that the latest wins

25.5.1 JSON.parse
https://262.ecma-international.org/14.0/#sec-json.parse

@08d2
Copy link

08d2 commented Nov 13, 2023

Are there any other commonly-used JSON parsers that will fail to parse {"a":1,"a":2} by default?

@prochac
Copy link

prochac commented Nov 13, 2023

Are there any other commonly-used JSON parsers that will fail to parse {"a":1,"a":2} by default?

JavaScript, the father of JSON si definitely not.
Python's stdlib json takes the second as well.

Maybe it's not a good strategy to be more Catholic than the Pope.

@dsnet
Copy link
Member

dsnet commented Nov 13, 2023

In "encoding/json/v2", we are aiming to reject duplicate names by default, but make it fairly easy to opt-out of the strict check.

The decision to reject duplicate names is driven by various security exploits that have occurred over the years that depended on abusing the unspecified behavior that occurs with duplicate names (across a variety of JSON implementations in various languages). See CVE-2017-12635 as one such example. This particular security stance doesn't benefit logging as much, since the content of a log message is rarely used to alter some critical piece of control flow. However, JSON is used for so much more than just logging (e.g., RPCs), that it is safer to be secure by default even if it is the more annoying default for logs analysis.

Also, even if various libraries allow for duplicate names, the behavior of how duplicates are handled varies widely. It is not enough to just ask the question of whether it is permissible. In the CVE mentioned above, Erland and JavaScript disagreed about the semantic meaning of a JSON object with duplicate names. In a similar way, permitting duplicate names, means that you may end up with logs analysis results that differ (perhaps wildly) depending on what JSON implementation you happened to use.

@marcusperlick
Copy link

I think this is not just a JSON topic. The log/slog package should be explicit about the semantics of duplicate keys provided via log.Log(…) or by having duplicates via log.With(…).Log(…). With a well-defined semantics each Handler must find its strategy to implement it. Otherwise, the information in a log would depend on the choice of the log Handler – not on what was provided by the application that uses the logger.

@madkins23
Copy link

madkins23 commented Jan 8, 2024

I stumbled onto this problem while I was trying to add some slog/zerolog handlers to a forked copy of go-logging-benchmarks. I filed a ticket there because half of the systems they're benchmarking duplicate a lot of test attributes (and the other half don't) and I felt like the benchmarks might not be completely accurate. Now I'm working on a suite of slog.Handler tests, with the primary motivation to check for duplicate fields, as slogtest does not.

My personal feeling is that duplicate fields in JSON are bad (many arguments above) and should be documented as such. The greatest advantage of slog is that handlers can be swapped out without rewriting the code that does the logging. This advantage is lessened if the logging output changes when the handlers are swapped, particularly for situations involving downstream processing of log files.

The simplest solution is #2 latest, though there may be unpredictability in "latest" in some situations. Or perhaps, if field "alpha" repeats, turn it into an array of all the values.

In any event, as mentioned earlier, the log/slog documentation should clarify the expected behavior, whatever it is supposed to be, sooner rather than later. I'm specifically thinking of the bullet items in the slog.Handler documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests