Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add golang.org/x/text/diff package #23113

Open
hirochachacha opened this issue Dec 13, 2017 · 12 comments
Open

proposal: add golang.org/x/text/diff package #23113

hirochachacha opened this issue Dec 13, 2017 · 12 comments

Comments

@hirochachacha
Copy link
Contributor

Many test code are something like:

func TestAAA(t *testing.T) {
  if got != want {
    t.Errorf("got %v, want %v")
  }
}

However, these kind of code are pretty useless when we expected multi-line values.
For example,

--- FAIL: TestExamples (0.00s)
	example_test.go:186: Import: got Play == "package main\n\nimport (\n\t\"fmt\"\n\t\"log\"\n\t\"os/exec\"\n)\n\nfunc main() {\n\tout, err := exec.Command(\"date\").Output()\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tfmt.Printf(\"The date is %s\\n\", out)\n}\n", want "package main\n\nimport (\n\t\"fmt\"\n\t\"log\"\n\t\"os/exec\"\n)\n\n// aaa\n\nfunc main() {\n    out, err := exec.Command(\"date\").Output()\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tfmt.Printf(\"The date is %s\\n\", out)\n}\n"
FAIL

So, I propose to support diff algorithm for making error messages human-readable.

The simplest implementation is executing the diff command. It doesn't work on windows.
But we can use fallback, so it can be a good starting point.

I'm not sure that supporting the "diff" package officially is worth doing.
But "internal" packages seem harmless.

@gopherbot gopherbot added this to the Proposal milestone Dec 13, 2017
@dsnet
Copy link
Member

dsnet commented Dec 13, 2017

We could vendor in github.com/google/go-cmp/cmp, which has a diff algorithm built in. When working on the standard library tests, I have wanted to reach for that package several times.

The cmp.Diff function was designed precisely for this purpose.

@hirochachacha
Copy link
Contributor Author

@dsnet I tried. but it seems to me that it doesn't help multi-line cases.

package main

import (
	"fmt"

	"github.com/google/go-cmp/cmp"
)

func main() {
	fmt.Println(cmp.Diff("a\nb\nc", "a\nb1\nc"))
}

output:

{string}:
	-: "a\nb\nc"
	+: "a\nb1\nc"

@hirochachacha
Copy link
Contributor Author

Oh, I understand.

package main

import (
	"fmt"
	"strings"

	"github.com/google/go-cmp/cmp"
)

func main() {
	fmt.Println(cmp.Diff(strings.Split("a\nb\nc", "\n"), strings.Split("a\nb1\nc", "\n")))
}

output:

{[]string}[1]:
	-: "b"
	+: "b1"

@rsc
Copy link
Contributor

rsc commented Feb 5, 2018

It seems to make more sense to encourage experimentation outside and then consider whether to bring a particular outside one in. There are also tests that run diff, and it would be nice to unify them.

Note that cmp is a lot bigger than just this one diff issue.

@ianlancetaylor
Copy link
Contributor

Putting on hold for a more specific proposal--should we use cmp, or introduce something new?

@dsnet
Copy link
Member

dsnet commented Mar 24, 2021

This issue might be superseded by #41980 and #45200.

@rsc
Copy link
Contributor

rsc commented Mar 24, 2021

I have made some progress on a text diff implementation that I think would be appropriate for the standard library.
Cmp seems related but not the same, since it is about structured data, not plain text files.

@dsnet
Copy link
Member

dsnet commented Mar 24, 2021

Cmp seems related but not the same, since it is about structured data, not plain text files.

Isn't a diffing algorithm primarily about determining how to conform one sequence of symbols into another sequence of symbols regardless of whether it is text or not? It seems to me that we should expose a higher level diffing algorithm for any sequence of symbols and make the "text diff implementation" just use that under the hood.

The API of github.com/google/go-cmp/cmp/internal/diff and github.com/pkg/diff/myers are similar in that the primary function operates on a signature like:

func(nx, ny int, f func(i, j) bool) []Operation

where nx and ny are the lengths of the list, and f reports whether two symbols at the provided index are equal and Operation is a enumeration of equal, insertion, and removal.

The diffing algorithm in github.com/google/go-cmp/cmp/internal/diff additionally makes use of a modify operator, which may require a signature like:

func(nx, ny int, f func(i, j) float64) []Operation

where f returning 0.0 could mean completely not equal, and 1.0 as completely equal, and any value in-between as a measure of how close to equal. The algorithm can use that information to discern when to use the modify operation or not.

I believe this issue, #41980, and #45200 should all ideally share the same diffing algorithm.

\cc @josharian

@rsc
Copy link
Contributor

rsc commented Mar 24, 2021

It's not necessarily true that arbitrary sequences and text files should use the same algorithms. They turn out to be somewhat different. How does cmp/internal/diff get used by diff itself?

@rsc rsc changed the title proposal: add package "internal/diff" proposal: add golang.org/x/text/diff package Mar 24, 2021
@rsc
Copy link
Contributor

rsc commented Mar 24, 2021

Retitling to make clear that we should add this in a way that everyone can import it. Still on hold.

@dsnet
Copy link
Member

dsnet commented Mar 25, 2021

It's not necessarily true that arbitrary sequences and text files should use the same algorithms. They turn out to be somewhat different. How does cmp/internal/diff get used by diff itself?

Is there a description of the algorithm that you're thinking of? The Eugene Myers algorithm that I believe @josharian implemented (and which inspired github.com/google/go-cmp/internal/diff) is defined in terms of "determining the differences between two sequences of symbols". Abstracting the difference algorithm to operate on "symbols" (regardless of whether the "symbol" is line of text, a Go value, or something else entirely) is I believe a powerful abstraction that would be unfortunate to lose.

How does cmp/internal/diff get used by diff itself?

The main way is to provide diffing for two Go slices, where the symbols being compared are two arbitrary Go values.

It is also used by cmp to provide specialized diffing of a Go string or []byte. Examples:

@josharian
Copy link
Contributor

Providing a diff for arbitrary text, where each symbol is a rune

And taking this even farther, for natural language text, it is better for each symbol to be a Unicode grapheme cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants