pkg / diff Goto Github PK

View Code? Open in Web Editor NEW

85.0 85.0 10.0 188 KB

License: BSD 3-Clause "New" or "Revised" License

Go 100.00%

diff's People

Contributors

Stargazers

Watchers

Forkers

ymcvalu mvdan ibean isgasho cristaloleg haunt98 super-rain duanxu-dx isabella232 seanpm2001

diff's Issues

add an API to output an ed script

I used them for a project long ago, as a way to easily have portable and readable "edit scripts" to go from plaintext file A to B. You can see it in action via diff --ed.

This is a kind of diff that's unlike the others, since it's not really meant for humans. I still think it belongs here, though.

Writeable is confusing

I initially read WriteToA as "write to A", not as "write A to ...".

Assuming #1 is done, we could rewrite this to:

type Writeable interface {
    LeftWriteTo(w io.Writer, i int) (int, error)
    RightWriteTo(w io.Writer, i int) (int, error)
    LeftFilename() string
    RightFilename() string
}

Replace "filename" with "name" throughout

I think it's unnecessary to explicitly say filenames, even if 99% of the use cases will involve files on disk.

For example, one might want to diff streams, or objects in memory, or whatever else. These might still have very valid names, such as "before-optimisation" and "after-optimisation".

add a similarity-based postprocessor to interleave consecutive additions/deletions

Consider this diff:

@@ -441,8 +441,26 @@ func (v *ReferenceNameIterator) Next() (string, error) {
        return C.GoString(ptr), nil
 }
 
-// Next retrieves the next reference. If the iterationis over, the
-// returned error is git.ErrIterOver
+// Slice returns a slice of all reference names.
+// It is a convenience function similar to calling Next repeatedly.
+// If Slice encounters an error, it returns all reference names up to that error.
+func (v *ReferenceNameIterator) Slice() ([]string, error) {
+       // TODO: implement directly in terms of git_reference_next_name?
+       var all []string
+       for {
+               s, err := v.Next()
+               if IsErrorCode(err, ErrIterOver) {
+                       return all, nil
+               }
+               if err != nil {
+                       return all, err
+               }
+               all = append(all, s)
+       }
+}
+
+// Next retrieves the next reference. If the iteration is over, the
+// returned error is git.ErrIterOver.
 func (v *ReferenceIterator) Next() (*Reference, error) {
        var ptr *C.git_reference

This diff would be greatly improved if it looked like this:

@@ -441,8 +441,26 @@ func (v *ReferenceNameIterator) Next() (string, error) {
        return C.GoString(ptr), nil
 }
 
+// Slice returns a slice of all reference names.
+// It is a convenience function similar to calling Next repeatedly.
+// If Slice encounters an error, it returns all reference names up to that error.
+func (v *ReferenceNameIterator) Slice() ([]string, error) {
+       // TODO: implement directly in terms of git_reference_next_name?
+       var all []string
+       for {
+               s, err := v.Next()
+               if IsErrorCode(err, ErrIterOver) {
+                       return all, nil
+               }
+               if err != nil {
+                       return all, err
+               }
+               all = append(all, s)
+       }
+}
+
-// Next retrieves the next reference. If the iterationis over, the
-// returned error is git.ErrIterOver
+// Next retrieves the next reference. If the iteration is over, the
+// returned error is git.ErrIterOver.
 func (v *ReferenceIterator) Next() (*Reference, error) {
        var ptr *C.git_reference

Not only it is clearer, it is likely to be far more helpful if you want to stage/patch part but not all of such a change.

Every diff algorithm generates the original diff (at least among diff algorithms implemented in git).

We could add an optional post-processing step that attempts to interleave additions and deletions based on the similarity of the content.

Discussion: generic version

The README says that we're waiting for generics. This issue can serve as a place to discuss that, as the generic proposals roll in.

Unfortunately, the latest generic proposal (summer 2019) doesn't play nicely with the diff package. Quoting myself from some private correspondence with Ian:

The Myer's diff implementation requires input of the form (a, b []T), where some elements of a will be compared for equality with the elements of b. The most common T are string, []byte, and []rune, but I do also want it to be more fully generic. What should the contract for T be?

It can't just be contracts.comparable, because, among other reasons, []byte doesn't support ==.

It's a bit awkward for it to be T Equal(T) bool, because then you need to use a named type for strings just to provide an implementation of Equals, which is the sort of "you've just moved around the boilerplate" problem you mention with interfaces. (This is also reminiscent of the comment on the main thread about intermingling methods and basic types, and the consequent need for type switches that work on contracts.) And you also have to convert from []string to []namedString (or [][]byte to []namedByteSlice) just to do the diff, which is annoying and O(n) instead of O(1).

...

After more pondering, the best solution I've come up with using the current proposal is to accept an equality function and use generic adapters for those. Here's an example, typed straight into email (so use a lenient parser). It uses SameSlice rather than Diff, because it exhibits the same challenges and is much simpler to implement and understand.

func (type T) SameSlice(a, b []T, eq func(T, T) bool) bool {
  if len(a) != len(b) {
    return false
  }
  for i := range a {
    if !eq(a[i],  b[i]) {
      return false
    }
  }
  return true
}

func (type T comparable) builtinEqual(x, y T) bool {
  return x == y
}

Call sites look like:

a := []string{ ... some strings ... }
b := []string{ ... more strings ... }
same := SameSlice(a, b, builtinEqual(string))

Or for [][]byte arguments:

same := SameSlice(a, b, bytes.Equal)

And you could write a generic adapter to go from types with an Equal method to an eq function.

So at least for the moment, I'm going to press forward with a non-generic version. We can re-evaluate if/when the next generics proposal draft arrives.

diff.Text OOMs with large different inputs

$ go version
go version devel +dec3d00b28 Thu Mar 25 04:21:29 2021 +0000 linux/amd64

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/myitcv/.cache/go-build"
GOENV="/home/myitcv/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/myitcv/gostuff/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/myitcv/gostuff"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/myitcv/gos"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/myitcv/gos/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="devel +dec3d00b28 Thu Mar 25 04:21:29 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/tmp/tmp.uHtLe5z1Zd/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build267893791=/tmp/go-build -gno-record-gcc-switches"

To reproduce:

cd $(mktemp -d)
curl -sL https://github.com/pkg/diff/files/6221280/repro.zip -o repro.zip
unzip repro.zip
go run main.go

Above, repro.zip is a reference to: repro.zip

get rid of the module dependency on sergi/go-diff

I understand that it's just used in a test, and that if a main package imports pkg/diff, it won't end up depending on sergi/go-diff at a package level.

However, it's still undesirable for pkg/diff to pull in such relatively heavy dependencies. Beyond go-diff itself, we also end up pulling testify and even a yaml library (!!!).

It's true that this will become less of a concern with lazy module loading. I don't think that's a satisfactory answer though, because it won't ship until at least 1.17 in six months, and people will keep using go 1.16 or earlier in their go.mod files for at least a year or two.

So, I think we should remove it. It seems to only be used in a single test, to show a diff when the unified diff tests fail. I get why we don't want to use our own code; if the diffs are broken, showing a broken diff in the output isn't good.

I see two reasonable options:

Use the host's diff tool in that case, like what gofmt -d does.
Just show the entire output as-is. The tests aren't large, and failures should be rare.

add intraline highlighting option

Make EditScript.IsIdentity use a non-pointer receiver

All of its other methods don't use pointer receivers, and this particular one never writes to any field, so it seems unnecessary to make this inconsistent.

Add a WriteOpt to color the output for a terminal

I think this will be a common need for user-facing tools; people generally expect diffs to be readable without having to look at the + or - prefixes of the unified diff.

I think we can begin with a TerminalColor that always uses the red/green colors. If anyone wants to control when this is used, e.g. via isatty, they can do that in the calling code. This way, if anyone requests other color formats such as HTML, we aren't using the confusing Color option name.

I've written this on my own code that uses pkg/diff, so I'm happy to send a PR if this seems like a good idea.

Is it worthwhile to separate Diffable and Writeable?

As I was reading the TODO: better name on DiffWriteable, I thought to myself - is there a good reason to separate the interfaces?

Unifying them would fix the naming issue, and make the simple use of the API simpler.

On the other hand, there's something I don't like about a "diffable" interface concerning io.Writer.

Add a Comparable helper

I was just thinking that most of the types one might want to diff can already be compared for equality, as per the language spec.

So, we could add func Comparable(a, b interface{}) DiffWriteable. It would ensure that both params are of the same type, that both are slice types, and that both element types are comparable.

For example, this would be useful with integers, arrays, custom struct types, et cetera.

I'd still keep Strings, because for the foreseeable future, avoiding interface parameters will make the code faster and easier.

Potentially, we could also use the Equal method as defined in https://github.com/google/go-cmp.

decide about output compatibility guarantees

The secondary consideration for cutting a v1.0 of this package is deciding about output compatibility.

Given exactly the same inputs, do we guarantee the same outputs over time?

Upsides to output compatibility:

Lets people use output in golden tests.
Acknowledges Hyrum's Law.

Downsides to output compatibility:

Severely limits bug fixes, optimizations, etc.
May force major version upgrades v2.0 much earlier than otherwise.
Slows down development by raising the threshold for releases.

I'm very torn. Opinions, @mvdan? (Or anyone else?)

make color output detect terminals?

Should we do terminal detection automatically? See e.g. tailscale/depaware#11.

It would add significant dependencies, which is unfortunate. Perhaps we should instead provide easily copy/paste-able code that does it? Or maybe with lazy module loading in 1.17 it is enough to put terminal detection in its own package?

Swap A/B with Left/Right

Are the names A and B common in diff algorithms or APIs? I would personally find Left and Right more intuitive. Diffs are very commonly displayed like this - some use < and > to display "removed" and "added", lots of code review UIs do side-by-side reviews, et cetera.

The names are also short enough to be spelled out in parameter names, avoiding the confusion of r with io.Reader.

rename master branch to main

need to check what impact this has on go modules

Rename Diffable to Interface?

We have sort.Interface in the standard library. Why go for diff.Diffable instead of diff.Interface? The former seems to stutter a bit.

On the other hand, we already have two interfaces here, while package sort only has the one.

expose means of working with EditScripts directly

I originally wrote this package because I wanted access to the guts of a diff, not just a printed representation of it. I hid the API for it for initial release.

I now find myself wanting access again. For the moment, I've locally exported some stuff as-is. See fae18f5 for details.

Question: Should we expose these nuts and bolts as-is? Should we provide accessors to (say) walk an edit script and emit a series of insertions, deletions, and equalities? (WriteUnified could be implemented using such a walk function.)

Opinions, @mvdan ?

Add a utility function with filenames

That is, func Files(left, right string) DiffWriteable. It would read the contents of the files up-front and diff the lines as []string or [][]byte, while including the original filenames. It would also do things that are common when one diffs plaintext files, such as treating windows line endings like unix line endings, or ignoring a trailing newline unless it appears/disappears.

In the rare case that someone needs this to support streaming, they can just easily implement the interface themselves.

reorganize into smaller packages

I have been thinking about making a push to get this to v1.0. The main problem is the API. The current API is unwieldy: it is a blend of high level helpers and low level data structures, and doing anything useful with it requires a half dozen lines.

I propose to break this up into multiple packages.

The top level package (github.com/pkg/diff) will contain at most a few functions that do the most commonly requested operations, such as read two files and emit a unified diff. It may also contain a bunch of options that can be passed to those functions. The implementation will be mostly glue.

The subpackages will contain lower level data structures, algorithms, and helpers. These can be used by people who want fine-grained control over everything or who want to do something unusual (like me).

Here's a very rough, incomplete sketch of packages and exported funcs/types/vars, to give a flavor:

diff

Files
Readers

diff/edit

Script
Range
Op

diff/myers

Diff
Pair

diff/write

Options
Color
Names
WriterToA etc

diff/unified

Write

diff/sidebyside (new, WIP)