New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/gob: optimize decoder for single object streams #46089
Comments
One idea would be to re-use a Decoder instance anyway, and ignore the duplicate type definitions. This is simple enough: --- a/src/encoding/gob/decoder.go
+++ b/src/encoding/gob/decoder.go
@@ -57,7 +57,7 @@ func NewDecoder(r io.Reader) *Decoder {
// recvType loads the definition of a type.
func (dec *Decoder) recvType(id typeId) {
// Have we already seen this type? That's an error
- if id < firstUserId || dec.wireType[id] != nil {
+ if id < firstUserId /*|| dec.wireType[id] != nil*/ {
dec.err = errors.New("gob: duplicate type received")
return
}
and does yield the type of performance improvement that I'm looking for:
However I'm not sure how best to do this safely. Since type ids are generated on demand, and values are only sent when non-zero, it seems that two objects would have overlapping type ids assigned if different fields were empty. Is that correct? In that case, the --- a/src/encoding/gob/decoder.go
+++ b/src/encoding/gob/decoder.go
@@ -57,7 +57,7 @@ func NewDecoder(r io.Reader) *Decoder {
// recvType loads the definition of a type.
func (dec *Decoder) recvType(id typeId) {
// Have we already seen this type? That's an error
- if id < firstUserId || dec.wireType[id] != nil {
+ if id < firstUserId {
dec.err = errors.New("gob: duplicate type received")
return
}
@@ -68,6 +68,12 @@ func (dec *Decoder) recvType(id typeId) {
if dec.err != nil {
return
}
+ if old := dec.wireType[id]; old != nil {
+ if same(old, wire) {
+ // Same as old type, ignore and use existing
+ return
+ }
+ }
// Remember we've seen this type.
dec.wireType[id] = wire
} Would something like this be feasible, or is there other state inside the decoder that would also need to be updated to make this work reliably? |
This assumption appears to be incorrect. From what I can see, type information is sent for every nested type regardless of value. If that's true in all cases, then it suggests that type identifiers would remain stable across encoder invocations as long as the underlying type has the same fields in the same order. If type ids are in fact stable in this case, then it becomes much safer to re-use decoder instances with the patch above. The If Does this seem like a reasonable approach? |
While this is a reasonable heuristic, it is not perfect. Specifically,
By checking to ensure the same type name and id are received, we can detect when fields are added or reordered, but only as long as their types also change. So if someone simply reordered two fields with the same type (i.e Perhaps a better checksum can be computed elsewhere containing the names of the fields as well, to more effectively error out when the underlying structure differs from the compiled decoder? |
Instead of using this for the sameness check, I tried to use Below is an updated version of my patch, which allows me to reuse a single decoder instance across streams. An error is only thrown in the case where the same decoder is used to try to decode two different generations of an object, signaling to the caller that the fields have been altered in a way that the cached decoder needs to be recreated and retried. The performance is 4-5x better compared to creating a fresh decoder every time. If this new behavior makes sense and is worth including in stdlib (optionally under a new setting), then I can take some time to write tests and submit a CL. But if it's not likely to be accepted then I will just maintain a fork.
diff --git a/src/encoding/gob/decoder.go b/src/encoding/gob/decoder.go
index b52aabe54b..5957a7435d 100644
--- a/src/encoding/gob/decoder.go
+++ b/src/encoding/gob/decoder.go
@@ -56,9 +56,9 @@ func NewDecoder(r io.Reader) *Decoder {
// recvType loads the definition of a type.
func (dec *Decoder) recvType(id typeId) {
- // Have we already seen this type? That's an error
- if id < firstUserId || dec.wireType[id] != nil {
- dec.err = errors.New("gob: duplicate type received")
+ // Is this type id within allowed range?
+ if id < firstUserId {
+ dec.err = errors.New("gob: reserved type id received")
return
}
@@ -68,6 +68,13 @@ func (dec *Decoder) recvType(id typeId) {
if dec.err != nil {
return
}
+
+ // Have we already seen this type id for a different type? That's an error
+ if old := dec.wireType[id]; old != nil && !old.same(wire) {
+ dec.err = errors.New("gob: conflicting type id received")
+ return
+ }
+
// Remember we've seen this type.
dec.wireType[id] = wire
}
diff --git a/src/encoding/gob/type.go b/src/encoding/gob/type.go
index 31c0ef7af1..a41c90c99d 100644
--- a/src/encoding/gob/type.go
+++ b/src/encoding/gob/type.go
@@ -670,6 +670,53 @@ func (w *wireType) string() string {
return unknown
}
+func (w *wireType) same(v *wireType) bool {
+ if w == v {
+ return true
+ }
+ switch {
+ case w.ArrayT != nil &&
+ v.ArrayT != nil:
+ return w.ArrayT.Name == v.ArrayT.Name &&
+ w.ArrayT.Elem == v.ArrayT.Elem &&
+ w.ArrayT.Len == v.ArrayT.Len
+ case w.SliceT != nil &&
+ v.SliceT != nil:
+ return w.SliceT.Name == v.SliceT.Name &&
+ w.SliceT.Elem == v.SliceT.Elem
+ case w.StructT != nil &&
+ v.StructT != nil:
+ if w.StructT.Name != v.StructT.Name {
+ return false
+ }
+ if len(w.StructT.Field) != len(v.StructT.Field) {
+ return false
+ }
+ for i := range w.StructT.Field {
+ if w.StructT.Field[i].Name != v.StructT.Field[i].Name ||
+ w.StructT.Field[i].Id != v.StructT.Field[i].Id {
+ return false
+ }
+ }
+ return true
+ case w.MapT != nil &&
+ v.MapT != nil:
+ return w.MapT.Name == v.MapT.Name &&
+ w.MapT.Key == v.MapT.Key &&
+ w.MapT.Elem == v.MapT.Elem
+ case w.GobEncoderT != nil &&
+ v.GobEncoderT != nil:
+ return w.GobEncoderT.Name == v.GobEncoderT.Name
+ case w.BinaryMarshalerT != nil &&
+ v.BinaryMarshalerT != nil:
+ return w.BinaryMarshalerT.Name == v.BinaryMarshalerT.Name
+ case w.TextMarshalerT != nil &&
+ v.TextMarshalerT != nil:
+ return w.TextMarshalerT.Name == v.TextMarshalerT.Name
+ }
+ return false
+}
+
type typeInfo struct {
id typeId
encInit sync.Mutex // protects creation of encoder |
FYI, wiring up this cacheable decoder to my app resulted in a 3x performance improvement in the gob unmarshal path:
This is with a fairly complex struct with lots of nested types. |
Though things looked great in my app benchmarks, with a real world dataset there are enough variations across object generations that many of the advantages of caching are lost. I think it's fair to say that the design of encoding/gob makes it unsuitable for my use-case. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I'm using
encoding/gob
to encode and decode single objects and store them in a persistent object database.I realize that
encoding/gob
is optimized for streams of objects. The documentation recommends using a single Encoder and Decoder to amortize the cost of compilation.That said, there are several advantages to
encoding/gob
which make it attractive to use in non-stream use-cases, such as:I am looking for ways to optimize the
gob.Decoder
implementation to improve performance to make it suitable for usage in non-streaming contexts.I created a benchmark to demonstrate the order of magnitude difference in performance when using a single
Decoder
vs creating/compiling a newDecoder
for every instance:Profiling the decoder shows that, as expected, most of the time is spent compiling the decoder instance (
encoding/gob.(*Decoder).compileDec
).I'm looking to understand what, if anything, could be done to optimize and cache the decoder to improve performance.
My first attempt was simply to re-use a single Decoder instance and feed in multiple streams. This doesn't work, and results in a "extra data in buffer" (or "duplicate type received" with b3235b7) error. The Decoder does not expect to receive type definitions multiple times.
Next, instead, I tried to add more caching. One hotspot in
compileDec
is due to reflection, and I was able to cache the result of the reflect call with a patch like so:This has a minor impact because the rest of
compileDec
is still quite slow and does a lot of repeated work.At this point, I am wondering:
Is it worth trying to make
encoding/gob
better suited for this use-case?I have explored other encoding libraries, but they seem to lack a lot of the bells and whistles that are already part of gob, making things like supporting interfaces very complicated
If encoding/gob can be improved, is it worth doing that via a new flag or feature, vs forking encoding/gob and modifying it for this new use-case?
How can the gob decoder compiler best be optimized? Are compiled decoders able to be cached, or does the design of the library make that impractical?
Would it be possible for instance to define more global types like the built-in ones, such that a decoder instance didn't have to recompile decOps for the same types over and over again?
cc @robpike
The text was updated successfully, but these errors were encountered: