proposal: x/tools/diff: a package for computing text differences #58893

adonovan · 2023-03-06T15:05:11Z

Recent work in gopls resulted in the creation of an internal package for computing text differences in the manner of the UNIX diff command, for applying those differences to a file in the manner of the patch command, and for presenting line-oriented diffs using +/- prefix notation aka GNU "unified" diff format (diff -u). Diff functionality is invaluable for developer tools that transform source files, and for tests that compare expected and actual outputs. We propose to publish our diff package with the public API shown below.

// Package diff computes differences between text files or strings.
package diff // import "golang.org/x/tools/diff"

// -- diff --

// Strings computes the differences between two strings.
// The resulting edits respect rune boundaries.
func Strings(before, after string) []Edit

// Bytes computes the differences between two byte slices.
// The resulting edits respect rune boundaries.
func Bytes(before, after []byte) []Edit

// An Edit describes the replacement of a portion of a text file.
type Edit struct {
	Start, End int    // byte offsets of the region to replace
	New        string // the replacement
}

func (e Edit) String() string

// -- apply --

// Apply applies a sequence of edits to the src buffer and returns the
// result. Edits are applied in order of start offset; edits with the
// same start offset are applied in the order they were provided.
//
// Apply returns an error if any edit is out of bounds,
// or if any pair of edits is overlapping.
func Apply(src string, edits []Edit) (string, error)

// ApplyBytes is like Apply, but it accepts a byte slice.
// The result is always a new array.
func ApplyBytes(src []byte, edits []Edit) ([]byte, error)

// SortEdits orders a slice of Edits by (start, end) offset.
// This ordering puts insertions (end = start) before deletions
// (end > start) at the same point, but uses a stable sort to preserve
// the order of multiple insertions at the same point.
// (Apply detects multiple deletions at the same point as an error.)
func SortEdits(edits []Edit)

// -- unified --

// Unified returns a unified diff of the old and new strings.
// If the strings are equal, it returns the empty string.
// The old and new labels are the names of the old and new files.
func Unified(oldLabel, newLabel, old, new string) string

// ToUnified applies the edits to content and returns a unified diff.
// It returns an error if the edits are inconsistent; see [Apply].
// The old and new labels are the names of the content and result files.
func ToUnified(oldLabel, newLabel, content string, edits []Edit) (string, error)

@pjweinb @findleyr

The text was updated successfully, but these errors were encountered:

pjweinb · 2023-03-06T16:04:46Z

The idea of doing this seems fine, but perhaps the documentation should include a few caveats to cover the following points. (There's lots of ways to write diff, and sadly no best one. This one is for inputs that are fairly similar, or fairly short.)

When the edit distance between the inputs is fairly small, the algorithm might find a minimal edit sequence, but it might not. All that is claimed is that applying the Edits to the 'before' gets the 'after'.
There is an internal (inaccessible) parameter that controls the maximum length of the returned []Edit. If this bound is hit, the returned []Edit will have sensible edits at the beginning and end of the input, but a big replace in the middle.

bcmills · 2023-03-06T16:20:16Z

See previously #23113, #41980.

ianlancetaylor · 2023-03-06T18:17:15Z

See also #45200.

DeedleFake · 2023-03-06T18:52:38Z

Instead of SortEdits(), maybe func (Edit) Less(Edit) bool would be more general. It's pretty straightforward to plug in via slices.SortFunc(edits, diff.Edit.Less).

adonovan · 2023-03-06T18:54:23Z

I agree that we should guarantee only that the composition of diff.Strings and diff.Apply is the identity, and nothing about the specific edits that it returns. We should probably also define the Unified text form in more detail.

Thanks for the links to related proposals. There's a fair bit of interest in both the narrow concept of text diff as proposed here, and in richer kinds of structured value diff for use in testing. If some form of the latter is accepted into the standard library, then perhaps simple text diff, on which it would depend, would also need to be in the standard library, though not necessarily exposed. I'm going to resist the temptation to argue that this should be a standard package. We can always do that later.

adonovan · 2023-03-06T19:05:05Z

Instead of SortEdits(), maybe func (Edit) Less(Edit) bool would be more general. It's pretty straightforward to plug in via slices.SortFunc(edits, diff.Edit.Less).

I tried that initially, but it turns out to be incorrect: it's imperative that you use sort.Stable for edits since insertions at the same point must preserve their relative order.

DeedleFake · 2023-03-06T19:23:05Z

I tried that initially, but it turns out to be incorrect: it's imperative that you use sort.Stable for edits since insertions at the same point must preserve their relative order.

There is also a slices.SortStableFunc().

Alternatively, you could add a mechanism to Edit that could help them maintain their relative ordering outside of external context such as a numbered priority. Otherwise, if you need a function to sort a []Edit, the implicit assumption is that an unsorted slice is likely to be obtained from somewhere, but if the relative ordering of the elements of that slice is important than there's also an assumption that that slice will always already be partially ordered correctly. That seems kind of error-prone to me.

adonovan · 2023-03-06T20:14:14Z

the implicit assumption is that an unsorted slice is likely to be obtained from somewhere, but if the relative ordering of the elements of that slice is important than there's also an assumption that that slice will always already be partially ordered correctly. That seems kind of error-prone to me.

The definition of Apply makes clear that the slice of edits is a list, not a set: the relative ordering of insertions is important. But Apply can call SortEdits internally. Within gopls, we use SortEdits after merging lists of edits to the same file, but simple concatenation should suffice. It's also used to ensure to ensure a deterministic order, which some clients have mistakenly assumed.

Perhaps we should remove SortEdits from the API and let gopls implement its own copy of that function.

earthboundkid · 2023-03-06T21:04:16Z

Can the Strings and Bytes functions be unified behind [byteseq string|[]byte]? Perhaps diff.Of?

adonovan · 2023-03-06T21:07:11Z

Can the Strings and Bytes functions be unified behind [byteseq string|[]byte]? Perhaps diff.Of?

They could, but it seems like a lot of trouble just to achieve name overloading.

earthboundkid · 2023-03-06T21:31:10Z

The other way around, I feel like it's a lot of work to have duplicate Strings and Bytes functions that work the same way instead of having callers cast their []byte to string or having a single generic function.

adonovan · 2023-03-06T22:09:03Z

The other way around, I feel like it's a lot of work to have duplicate Strings and Bytes functions that work the same way instead of having callers cast their []byte to string or having a single generic function.

The allocation penalty of the bytes conversions for large files when the diff is empty (the common case, which needn't allocate any Edits) is a pretty significant multiple of the cost of the actual diff algorithm: 8x in one quick benchmark of a 10KB file.

DeedleFake · 2023-03-06T22:10:00Z

There's no need to maintain two versions of the functionality in either case. It's very easy as proposed to just use the same implementation since it doesn't perform any modification in Bytes():

func Strings(before, after string) []Edit {
  return Bytes(
    unsafe.Slice(unsafe.StringData(before), len(before)),
    unsafe.Slice(unsafe.StringData(after), len(after)),
  )
}

Or, so as to avoid the unsafe usage, you could have each of the exported functions call a third, unexported function that's generic like your suggestion. I don't think that's really necessary, though.

adonovan · 2023-03-06T23:05:04Z

it doesn't perform any modification in Bytes():

That's true, but the proof requires auditing a thousand lines of code across half a dozen files, so I hesitate.

The generics approach is safer and more appealing, and it should nicely handle even the intentional differences between bytes and strings. For example, string(x)==string(y) is an efficient way to compare bytestring values, and Edit{New: string(x[i:j])} allocates if and only if x is a []byte. (Sadly golang.org/x/tools still can't use generics quite yet, but soon.)

It may cause significant expansion of the object code though; I should measure.

adonovan · 2023-03-08T14:20:10Z

Chatting with @rsc yesterday he raised a number of interesting questions:

Should the Edit.New field be a string as proposed, potentially incurring allocation when the input is a []byte slice, or a pair of indices into the "new" text? The latter avoids allocation but requires that the new text be present in order to further interpret the slice of edits.
Should the order of arguments to Unified be (oldLabel string, oldContent []byte, newLabel string, newContent []byte), alternating labels and contents, so that it's impossible to switch a label and a content by mistake? Personally I'm not sure the risk is so great that it's worth doing. If you execute the statement even once with the arguments switched you'll notice immediately (e.g. your test for an empty diff will fail). And if we provided a string-content variant of Unified (as perhaps we should for both convenience and efficiency) then it would nullify the benefit.
What exactly is promised regarding the diff? I think we should promise only that the composition Apply(Diff(x, y), x) yields y. For unified diffs we don't have a parser; in practice the one that matters is GNU patch, so we could require that our output can be successfully parsed and applied by it, but GNU patch's "detailed description" of the unified diff format leaves some things unstated.
The Myers algorithm (without a depth bound) is quadratic. Our implementation does use a depth bound, but may yet generate very suboptimal diffs for large files with widespread edits. (Classic worst case: removing carriage returns before line feeds in Windows text files.) Is that acceptable? Russ added an implementation of Hunt-Szymanski internal to the standard library; it is O(n log n) and based on unique lines.
Our implementation returns sub-line edits (e.g. the diff for a fix to a single-letter typo may be minimal). Internally it merges and expands these to line granularity when printing unified output, but the function to do this is not exposed. Do real clients want sub-line edits? (gopls does)
Proposals to add a testing-specific diff function (e.g. testing.Diff) might have an easier time answering some of these questions because the domain is more constrained: typically, in a test, the inputs and outputs small, they are nearly always equal, and the desired output is just unified text, not a list-of-edits data structure. A general purpose API such as this one might need to consider more extreme performance outliers.

bcmills · 2023-03-08T14:36:49Z

For unified diffs we don't have a parser; in practice the one that matters is GNU patch, so we could require that our output can be successfully parsed and applied by it, but GNU patch's "detailed description" of the unified diff format leaves some things unstated.

That sounds like an excellent application for a fuzz test! 🙃

rsc · 2023-03-08T19:02:19Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

dsnet · 2023-03-08T19:16:45Z

On one hand, I would like this to be a generic Diff function that can compare any two []E, which is what cmp needs. However, I recognize that there are a lot of cool optimizations that only exist if we treat []E as essentially a sequence of runes (or bytes).

I think there's utility in providing a Diff that is specialized for text, but would still love to see a Diff that can find the difference between two lists of length n and m given only an equality function (similar to the internal Difference function used by cmp).

dolmen · 2023-04-12T10:32:56Z

This package doesn't seem to be specific to the Go language like others packages in x/tools. From README:

This repository provides the golang.org/x/tools module, comprising various tools and packages mostly for static analysis of Go programs

So why not x/text/diff?

adonovan · 2023-04-12T12:42:49Z

This package doesn't seem to be specific to the Go language like others packages in x/tools. [...] So why not x/text/diff?

The text module is primarily concerned with internationalization and localization, and deals with Unicode normalization, whereas this implementation computes differences over rune sequences without regard to such matters. Also, diff is an important component of typical code transformation tools, which is very much the domain of the x/tools module.

rsc · 2023-04-12T17:37:14Z

FWIW I think x/text/diff would be a fine home for a diff package - text doesn't have to be all about Unicode. For example text/template is not. But we probably still shouldn't commit to a diff API right now anyway.

pjweinb · 2023-04-25T14:30:26Z

I think this package is too special purpose. It's designed for an editor, where sub-line diffs are useful, but there's no point in producing a large number of them.

mpx · 2023-04-25T14:53:18Z

This issue is actively tracked through the proposal process, presumably it should be re-opened to allow that process to continue?

DeedleFake · 2023-04-25T15:05:53Z

@pjweinb doesn't seem to be a member of the Go team. How did he close the issue? Shouldn't that be something only the author of the issue or a moderator can do?

pjweinb · 2023-04-25T15:18:33Z

ah. well, i am on the Go team, and I think Alan meant this to be our proposal, so that's why I closed it. But we can reopen it if that would allow for a cleaner process.

…

On Tue, Apr 25, 2023 at 11:06 AM DeedleFake ***@***.***> wrote: @pjweinb <https://github.com/pjweinb> doesn't seem to be a member of the Go team. How did he close the issue? Shouldn't that be something only the author of the issue or a moderator can do? — Reply to this email directly, view it on GitHub <#58893 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJIAI2BDYL6ORPPZAEXRMTXC7R6LANCNFSM6AAAAAAVRHW3Q4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

adonovan · 2023-04-25T15:19:27Z

Hi, Peter is indeed a member of the Go team, and the main author of the package. The proposal was a joint one, and we are both happy to retract it in light of the various concerns raised. Sorry for the confusion.

earthboundkid · 2023-04-25T15:46:03Z

@pjweinb, I think there's a good avatar you could use for your Github account online somewhere, so people recognize you.

adonovan added the Proposal label Mar 6, 2023

gopherbot added this to the Proposal milestone Mar 6, 2023

This comment was marked as duplicate.

Sign in to view

pjweinb closed this as completed Apr 25, 2023

golang locked and limited conversation to collaborators Apr 24, 2024

gopherbot added the FrozenDueToAge label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: x/tools/diff: a package for computing text differences #58893

proposal: x/tools/diff: a package for computing text differences #58893

adonovan commented Mar 6, 2023 •

edited

pjweinb commented Mar 6, 2023 •

edited

bcmills commented Mar 6, 2023 •

edited

ianlancetaylor commented Mar 6, 2023

DeedleFake commented Mar 6, 2023

adonovan commented Mar 6, 2023

adonovan commented Mar 6, 2023

DeedleFake commented Mar 6, 2023

adonovan commented Mar 6, 2023

earthboundkid commented Mar 6, 2023

adonovan commented Mar 6, 2023

earthboundkid commented Mar 6, 2023

adonovan commented Mar 6, 2023

DeedleFake commented Mar 6, 2023

adonovan commented Mar 6, 2023 •

edited

adonovan commented Mar 8, 2023 •

edited

bcmills commented Mar 8, 2023

rsc commented Mar 8, 2023

dsnet commented Mar 8, 2023

This comment was marked as duplicate.

dolmen commented Apr 12, 2023 •

edited

adonovan commented Apr 12, 2023

rsc commented Apr 12, 2023

pjweinb commented Apr 25, 2023

mpx commented Apr 25, 2023

DeedleFake commented Apr 25, 2023

pjweinb commented Apr 25, 2023 via email

adonovan commented Apr 25, 2023

earthboundkid commented Apr 25, 2023

proposal: x/tools/diff: a package for computing text differences #58893

proposal: x/tools/diff: a package for computing text differences #58893

Comments

adonovan commented Mar 6, 2023 • edited

pjweinb commented Mar 6, 2023 • edited

bcmills commented Mar 6, 2023 • edited

ianlancetaylor commented Mar 6, 2023

DeedleFake commented Mar 6, 2023

adonovan commented Mar 6, 2023

adonovan commented Mar 6, 2023

DeedleFake commented Mar 6, 2023

adonovan commented Mar 6, 2023

earthboundkid commented Mar 6, 2023

adonovan commented Mar 6, 2023

earthboundkid commented Mar 6, 2023

adonovan commented Mar 6, 2023

DeedleFake commented Mar 6, 2023

adonovan commented Mar 6, 2023 • edited

adonovan commented Mar 8, 2023 • edited

bcmills commented Mar 8, 2023

rsc commented Mar 8, 2023

dsnet commented Mar 8, 2023

This comment was marked as duplicate.

dolmen commented Apr 12, 2023 • edited

adonovan commented Apr 12, 2023

rsc commented Apr 12, 2023

pjweinb commented Apr 25, 2023

mpx commented Apr 25, 2023

DeedleFake commented Apr 25, 2023

pjweinb commented Apr 25, 2023 via email

adonovan commented Apr 25, 2023

earthboundkid commented Apr 25, 2023

adonovan commented Mar 6, 2023 •

edited

pjweinb commented Mar 6, 2023 •

edited

bcmills commented Mar 6, 2023 •

edited

adonovan commented Mar 6, 2023 •

edited

adonovan commented Mar 8, 2023 •

edited

dolmen commented Apr 12, 2023 •

edited