paulfitz / daff Goto Github PK

align and compare tables

Home Page: https://paulfitz.github.io/daff

License: MIT License

Java 79.84% Ruby 0.65% JavaScript 9.10% PHP 0.23% CMake 1.20% Shell 1.67% C++ 1.04% Perl 0.11% Python 1.72% HTML 1.11% Makefile 2.80% SWIG 0.54%

csv csv-diffs diff comparing-tables sqlite

daff's Introduction

daff: data diff

This is a library for comparing tables, producing a summary of their differences, and using such a summary as a patch file. It is optimized for comparing tables that share a common origin, in other words multiple versions of the "same" table.

For a live demo, see:

https://paulfitz.github.io/daff/

Install the library for your favorite language:

npm install daff -g  # node/javascript
pip install daff     # python
gem install daff     # ruby
composer require paulfitz/daff-php  # php
install.packages('daff') # R wrapper by Edwin de Jonge
bower install daff   # web/javascript

Other translations are available here:

https://github.com/paulfitz/daff/releases

Or use the library to view csv diffs on github via a chrome extension:

https://github.com/theodi/csvhub

The diff format used by daff is specified here:

http://paulfitz.github.io/daff-doc/spec.html

This library is a stripped down version of the coopy toolbox (see http://share.find.coop). To compare tables from different origins, or with automatically generated IDs, or other complications, check out the coopy toolbox.

The program

You can run daff/daff.py/daff.rb as a utility program:

$ daff
daff can produce and apply tabular diffs.
Call as:
  daff a.csv b.csv
  daff [--color] [--no-color] [--output OUTPUT.csv] a.csv b.csv
  daff [--output OUTPUT.html] a.csv b.csv
  daff [--www] a.csv b.csv
  daff parent.csv a.csv b.csv
  daff --input-format sqlite a.db b.db
  daff patch [--inplace] a.csv patch.csv
  daff merge [--inplace] parent.csv a.csv b.csv
  daff trim [--output OUTPUT.csv] source.csv
  daff render [--output OUTPUT.html] diff.csv
  daff copy in.csv out.tsv
  daff in.csv
  daff git
  daff version

The --inplace option to patch and merge will result in modification of a.csv.

If you need more control, here is the full list of flags:
  daff diff [--output OUTPUT.csv] [--context NUM] [--all] [--act ACT] a.csv b.csv
     --act ACT:     show only a certain kind of change (update, insert, delete, column)
     --all:         do not prune unchanged rows or columns
     --all-rows:    do not prune unchanged rows
     --all-columns: do not prune unchanged columns
     --color:       highlight changes with terminal colors (default in terminals)
     --context NUM: show NUM rows of context (0=none)
     --context-columns NUM: show NUM columns of context (0=none)
     --fail-if-diff: return status is 0 if equal, 1 if different, 2 if problem
     --id:          specify column to use as primary key (repeat for multi-column key)
     --ignore:      specify column to ignore completely (can repeat)
     --index:       include row/columns numbers from original tables
     --input-format [csv|tsv|ssv|psv|json|sqlite]: set format to expect for input
     --eol [crlf|lf|cr|auto]: separator between rows of csv output.
     --no-color:    make sure terminal colors are not used
     --ordered:     assume row order is meaningful (default for CSV)
     --output-format [csv|tsv|ssv|psv|json|copy|html]: set format for output
     --padding [dense|sparse|smart]: set padding method for aligning columns
     --table NAME:  compare the named table, used with SQL sources. If name changes, use 'n1:n2'
     --unordered:   assume row order is meaningless (default for json formats)
     -w / --ignore-whitespace: ignore changes in leading/trailing whitespace
     -i / --ignore-case: ignore differences in case

  daff render [--output OUTPUT.html] [--css CSS.css] [--fragment] [--plain] diff.csv
     --css CSS.css: generate a suitable css file to go with the html
     --fragment:    generate just a html fragment rather than a page
     --plain:       do not use fancy utf8 characters to make arrows prettier
     --unquote:     do not quote html characters in html diffs
     --www:         send output to a browser

Formats supported are CSV, TSV, Sqlite (with --input-format sqlite or the .sqlite extension), and ndjson.

Using with git

Run daff git csv to install daff as a diff and merge handler for *.csv files in your repository. Run daff git for instructions on doing this manually. Your CSV diffs and merges will get smarter, since git will suddenly understand about rows and columns, not just lines:

The library

You can use daff as a library from any supported language. We take here the example of Javascript. To use daff on a webpage, first include daff.js:

<script src="daff.js"></script>

Or if using node outside the browser:

var daff = require('daff');

For concreteness, assume we have two versions of a table, data1 and data2:

var data1 = [
    ['Country','Capital'],
    ['Ireland','Dublin'],
    ['France','Paris'],
    ['Spain','Barcelona']
];
var data2 = [
    ['Country','Code','Capital'],
    ['Ireland','ie','Dublin'],
    ['France','fr','Paris'],
    ['Spain','es','Madrid'],
    ['Germany','de','Berlin']
];

To make those tables accessible to the library, we wrap them in daff.TableView:

var table1 = new daff.TableView(data1);
var table2 = new daff.TableView(data2);

We can now compute the alignment between the rows and columns in the two tables:

var alignment = daff.compareTables(table1,table2).align();

To produce a diff from the alignment, we first need a table for the output:

var data_diff = [];
var table_diff = new daff.TableView(data_diff);

Using default options for the diff:

var flags = new daff.CompareFlags();
var highlighter = new daff.TableDiff(alignment,flags);
highlighter.hilite(table_diff);

The diff is now in data_diff in highlighter format, see specification here:

http://paulfitz.github.io/daff-doc/spec.html

[ [ '!', '', '+++', '' ],
  [ '@@', 'Country', 'Code', 'Capital' ],
  [ '+', 'Ireland', 'ie', 'Dublin' ],
  [ '+', 'France', 'fr', 'Paris' ],
  [ '->', 'Spain', 'es', 'Barcelona->Madrid' ],
  [ '+++', 'Germany', 'de', 'Berlin' ] ]

For visualization, you may want to convert this to a HTML table with appropriate classes on cells so you can color-code inserts, deletes, updates, etc. You can do this with:

var diff2html = new daff.DiffRender();
diff2html.render(table_diff);
var table_diff_html = diff2html.html();

For 3-way differences (that is, comparing two tables given knowledge of a common ancestor) use daff.compareTables3 (give ancestor table as the first argument).

Here is how to apply that difference as a patch:

var patcher = new daff.HighlightPatch(table1,table_diff);
patcher.apply();
// table1 should now equal table2

For other languages, you should find sample code in the packages on the Releases page.

Supported languages

The daff library is written in Haxe, which can be translated reasonably well into at least the following languages:

Javascript
Python
Java
C#
C++
Ruby (using an unofficial haxe target developed for daff)
PHP

Some translations are done for you on the Releases page. To make another translation, or to compile from source first follow the Haxe language introduction for the language you care about. At the time of writing, if you are on OSX, you should install haxe using brew install haxe. Then do one of:

make js
make php
make py
make java
make cs
make cpp

For each language, the daff library expects to be handed an interface to tables you create, rather than creating them itself. This is to avoid inefficient copies from one format to another. You'll find a SimpleTable class you can use if you find this awkward.

Other possibilities:

There's a daff wrapper for R written by Edwin de Jonge, see https://github.com/edwindj/daff and http://cran.r-project.org/web/packages/daff
There's a hand-written ruby port by James Smith, see https://github.com/theodi/coopy-ruby

API documentation

You can browse the daff classes at http://paulfitz.github.io/daff-doc/

Reading material

https://specs.frictionlessdata.io/tabular-diff : a specification of the diff format we use.
http://theodi.org/blog/csvhub-github-diffs-for-csv-files : using this library with github.
ropensci/unconf15#19 : a thread about diffing data in which daff shows up in at least four guises (see if you can spot them all).
http://theodi.org/blog/adapting-git-simple-data : using this library with gitlab.
http://okfnlabs.org/blog/2013/08/08/diffing-and-patching-data.html : a summary of where the library came from.
http://blog.okfn.org/2013/07/02/git-and-github-for-data/ : a post about storing small data in git/github.
http://blog.ouseful.info/2013/08/27/diff-or-chop-github-csv-data-files-and-openrefine/ : counterpoint - a post discussing tracked-changes rather than diffs.
http://blog.byronjsmith.com/makefile-shortcuts.html : a tutorial on using make for data, with daff in the mix. "Since git considers changes on a per-line basis, looking at diffs of comma-delimited and tab-delimited files can get obnoxious. The program daff fixes this problem."

License

daff is distributed under the MIT License.

daff's People

Contributors

Stargazers

Watchers

Forkers

gijs dogmatic69 svisser pombredanne melmendo jordigh okdistribute jamim bradparks uvardm andyli crazykid199 davidedelvento foysavas sibusiso16 alundeen dplusic perlence yuitowest refaqtor everypolitician zaxebo1 djsfcom eddies ajostergaard carsonshan muepsilonpsi siddthota locnguyenhuu reactual tspannhw eiso rairulyle ktaranov reginaldosoares gwd666 emma-zaber redisrupt mdheller bibaushi bwiedmann hugoberry bakkerthehacker vergenzt xet7 sightmachine hasnat compy-386 isgasho edg2s b4ux1t3 dmkwon combinatorist tyanakaz sangohan mneuhaus vladdragoiarrk airdgo hondrytravis arpitjain799 rupurt mardukbp adamdubey cindy-varicent

daff's Issues

Enable user-specified record separators

Would be great to add a -t command-line switch to do without unnecessary preprocessing of tab- and semicolon-separated files.

bug when everything is an added column?

I've been able to produce this bug, where the column names are repeated twice when everything in the table is an added column. Can you reproduce it?

daff assumes python to be python3, breaks if system default is python2

The first line of daff.py is: #!/usr/bin/python
In my debian "testing" system /usr/bin/python points to python2.7
daff.py (1.3.18 downloaded with pip) throws an error an aborts when it reaches the line 10819:
...
File "/usr/local/bin/daff.py", line 10819, in println
python_lib_Sys.stdout.buffer.write((("" + HxOverrides.stringOrNull(unicode)) + "\n").encode("utf-8", "strict"))
AttributeError: 'file' object has no attribute 'buffer'

The input is ssv, however I do not specify --input-format ssv so it is trying to interpret it as csv, when I run it with --input-format ssv it works fine. Still IMHO I believe that this error should not happen, as it is one of python2 versus python3 because sys.stdout.buffer does not exist in python2 but it does in python3.

Compiling from sources

Related with issue #59 I'm trying to build daff. I installed binaries for haxe v3.2.1 from their website since my distro (mint) provided something that did not work because apparently was too old (v3.0.0)

I tried to run make and got the following:

#######################################################
## Set up directories
mkdir -p bin
mkdir -p lib
#######################################################
## Generate javascript
haxe language/js.hxml # produces lib/daff.js
#######################################################
## Make library version
cat env/js/fix_exports.js >> lib/daff.js
cat env/js/table_view.js >> lib/daff.js
cat env/js/ndjson_table_view.js >> lib/daff.js
cat env/js/sqlite_database.js >> lib/daff.js
cat env/js/util.js >> lib/daff.js
#######################################################
## Make executable version (just add shebang)
echo "#!/usr/bin/env node" > bin/daff.js
cat lib/daff.js >> bin/daff.js
chmod u+x bin/daff.js
#######################################################
## Check size
 10561  31095 260971 bin/daff.js
./scripts/run_tests.sh
=======================================================
== test_3way_alignment.js
make: *** [test] Error 1

For my particular purposes (using daff just as CLI diff, not as a library) I think I need the cpp target, so I tried make cpp and got:

haxe language/cpp.hxml
/bin/sh: 1: haxelib: not found
End_of_file
make: *** [cpp] Error 1

Note that I also had downloaded hxcpp v3.2.102 and tried to build per their instructions but at the second haxe compile.hxml I've got Error: /bin/sh: 1: haxelib: not found

So I tried to figure out what this haxelib was and how to install it, with mixed success, because everybody who is installing it is doing what daff is doing in its travis file namely using haxelib to install haxelib (line 23) and I have a bootstrap issue. Before chasing the rabbit in the rabbit hole, I'd like to:

suggest some clearer instructions on building daff, including all the needed part of the haxe dependencies (wish there was a single binary for all the needed parts of haxe instead of that many...)
ask you how to best proceed, since I may be barking at the wrong tree, see issue #59 -- as long as I get one target built, I'm probably fine and can rely on travis for testing the others I can't build on my machine.

Reset environment issue

I am not sure if this is only happening in php but when running multiple diffs in one request there are problems with the data because of previous diffs.

I found that doing the following solves this problem

$GLOBALS['%s'] = new _hx_array(array());
$GLOBALS['%e'] = new _hx_array(array());

Not sure if there is any better way or not, but probably good to have some kind of reset method that can be called to reset the environment to how it was before doing a diff.

C++ quickstart

First, thanks for daff!

I am tying to link daff into my existing C++ code but, having never used Haxe, I'm finding myself groping around a bit trying to get started.

Having cloned daff, and installed haxe 3.2.1 I have done make cpp and get a daff/bin/include directory. I am using the following simple C++ program, based on daff/packaging/cpp_recipe/main.cpp, to try and get started:

#include <iostream>

#include <coopy/SimpleTable.h>
#include <coopy/SimpleCell.h>
#include <coopy/Coopy.h>
#include <coopy/Alignment.h>
#include <coopy/CompareTable.h>
#include <coopy/CompareFlags.h>
#include <coopy/TableDiff.h>

int main() {
    ::coopy::SimpleTable t1 = ::coopy::SimpleTable_obj::__new(3,3);
    ::coopy::SimpleTable t2 = ::coopy::SimpleTable_obj::__new(3,3);
    ::coopy::SimpleTable table_diff = ::coopy::SimpleTable_obj::__new(0,0);
    ::coopy::CompareTable cmp = ::coopy::Coopy_obj::compareTables(t1,t2);

    ::coopy::Alignment alignment = cmp->align();
    ::coopy::CompareFlags flags = ::coopy::CompareFlags_obj::__new();
    ::coopy::TableDiff highlighter = ::coopy::TableDiff_obj::__new(alignment,flags);

    highlighter->hilite(table_diff);
    ::String tab = table_diff->tableToString(table_diff);
    std::cout<<tab.__CStr();

    return 0;
}

which I am trying to compile using:

g++ -I./daff/bin/include -I/usr/share/haxelib/hxcpp/3,2,205/include main.cpp

but get the error:

daff/bin/include/coopy/Table.h: In member function ‘void coopy::Table_delegate_<IMPL>::__Visit()’:
/usr/share/haxelib/hxcpp/3,2,205/include/hx/GC.h:369:16: error: ‘__inCtx’ was not declared in this scope
   { if (ioPtr) __inCtx->visitObject( (hx::Object **)&ioPtr); }

Clearly, my build command is missing some compiler flags or other config vars. Any daff C++ quickstart or examples would be much appreciated.

Composer support for php

Would it be possible to create maybe daff-php which includes a composer.json file and link it up in packagist. This will make including the lib much easier.

The file is something like this:

{
    "name": "paulfitz/daff-php",
    "description": "align and compare tables http://paulfitz.github.io/daff",
    "keywords": ["php","csv","diff"],
    "type": "library",
    "license": "MIT",
    "authors": [
        {
            "name": "Paul Fitz",
            "email": "[email protected]"
        }
    ],
    "require": {
        "php": ">=5.3.0"
    },

}

Not sure if it would be possible to do this with the current zip, but should be pretty easy to set up a repo and adjust the build to push the php to a repo and add the json file

Support floating point diffs?

I don't know if this request is out of scope for daff, but a really nice feature for a "scientific diff" would be to enable floating point support, perhaps with an appropriate CLI switch or function call argument (off by default, of course).

When enabled, text would be converted to floating point (when possible) and the diff would be performed by floating point difference, in relative terms, say something like d = 2 * (v1 - v2 ) / (v1 + v2). If d <= t for a given threshold t (could be 0.1% be default), the numbers could be considered identical. Also, t could be set to 0, which will be convenient to consider identical things like 1.23e4 and 12300.0

I'll be happy to write the implementation and unit tests for this feature, provided:

it's of general interest
you give me some minimal guidance about how to proceed since I'm not familiar with the development process of daff

Patch breaks in Java

I am finding an incorrect patching in Java in some of the files I am analyzing.

To be more specific, with these two for example that come from OpenData portals the daff output is perfect but when applying coopy.Coopy.patch(table1, table2, null) the modified table1 is a complete mess.

I am not sure but could it be due to the language/encoding? The problems I encountered happened with some of the files in not only English language

20140812120302.txt
20140822230142.txt

And this is the output:

20140822230142.txt

license clarification

It appears that coopy is distributable under the GPL license while daff is distributable under the MIT license. However, daff appears to contain code from coopy. Does this imply that any coopy code in daff is dual licensed under the GPL and MIT licenses?

daff breaks horribly if file is not utf8

On Windows, tried both with cmd and a git bash shell:

csv_windows-1255.zip

$ daff.py version
1.3.18
$ daff.py 1.csv 2.csv
Traceback (most recent call last):
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 11304, in <module>
    Coopy.main()
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3447, in main
    return coopy.coopyhx(io)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3333, in coopyhx
    return self.run(args,io)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3284, in run
    a = self.loadTable(aname)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 2640, in loadTable
    txt = self.io.getContent(name)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 9752, in getContent
    return sys_io_File.getContent(name)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 11018, in getContent
    content = f.read(-1)
  File "C:\users\sonoflilit\.virtualenvs\analysts\lib\codecs.py", line 668, in read
    return self.reader.read(size)
  File "C:\users\sonoflilit\.virtualenvs\analysts\lib\codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 4: invalid continuation byte

$ which daff
/c/Program Files/nodejs/daff
$ daff version
1.3.18
$ daff 1.csv 2.csv
@@,a,b

of course, the reason I care is that excel works notoriously badly with utf8 csvs, so my git repository is full of csvs in other encodings, and I can't convert them as part of git diff...

P.S. does anyone here know why git would accept my .gitattributes entry for *.tsv but would silently ignore the identical entry for *.csv?

License for coopyhx?

Hi! I just discovered http://growrows.com/ today and it looks very promising — diff two EtherCalc tables into a new diff-table, and applying such diff-tables as merges is very exciting, and it also provides a great way to visualize the SocialCalc "Audit trail".

While coopy is GPLv2, I noticed that coopyhx does not yet have a license file.

As EtherCalc is CC0 (public domain) and SocialCalc is CPAL (weak copyleft), I wonder if a similar permissive (e.g. MIT) or weak-copyleft license (e.g. LGPL) is acceptable to you, especially for the client-side JS context.

If you'd like to go with GPLv2 as well, that's awesome too, I'd still be happy to contribute and collaborate. :-)

display options

I'm looking for display options; for instance, to make every row show up, and to disable the header row. Are there such things? I'll look around myself, but just in case I've missed something obvious...

Stop after N diffs are found

Is there a parameter (or can one be added?) to abort after finding the first N diffs? This will be really helpful for tables with large diffs, and working on them a few at a time makes sense than seeing all the diffs at the same time.

Thanks!

python package for 1.3.22 not working in python 3

Hi Paul,

when trying to import daff lib 1.3.22 into python, we got this error:

In [1]: import daff
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-539af51566a0> in <module>()
----> 1 import daff

/Users/semio/.pyenv/versions/3.6.0/bin/daff.py in <module>()
     64 import functools as python_lib_Functools
     65 import json as python_lib_Json
---> 66 from itertools import imap
     67 from itertools import ifilter
     68 try:

ImportError: cannot import name 'imap'

where imap is python 2 only function. And lib for 1.3.19 is working on python 3. Please check if you can add back support for that. Let me know if you need more infomation/help form me :)

Configure `git show` to work as `git diff` does

git diff HEAD~1..HEAD works as expected, but git show shows a standard line-by-line diff. Can daff git tsv configure git show to use daff?

Diff large files

I am trying to diff CSV with around 10k lines and it seems like the code gets slower and slower the more rows there are.

I am using the PHP libs, and the code being used is as follows:

protected function _diff() {
    $GLOBALS['%s'] = new _hx_array(array());
    $GLOBALS['%e'] = new _hx_array(array());

    $csvOld = $this->loadCsv($this->old());
    $oldData = @new coopy_PhpTableView($csvOld);

    $csvNew = $this->loadCsv($this->new());
    $newData = @new coopy_PhpTableView($csvNew);
    if ($oldData == $newData) {
        return array();
    }

    $dataDiff = array();
    $tableDiff = new coopy_PhpTableView($dataDiff);
    $Compare = new coopy_CompareFlags();
    $Compare->unchanged_column_context = 3;
    $highlighter = new coopy_TableDiff(coopy_Coopy::compareTables($oldData, $newData)->align(), $Compare);
    $highlighter->hilite($tableDiff);

    return $tableDiff->data;
}

300 rows - 48s
1000 rows - 3m 24s
2500 rows - 20m8s
10k rows - left it running over the weekend, never finished.

Is there anything I can do to speed things up or a better way of getting a diff going?

Its getting stuck on the line $highlighter->hilite($tableDiff); if that makes any difference.

package daff for java/maven

Daff is written in a language (haxe) that can produce native code in several languages. This is handy, but even handier if that code gets packaged up and released for whatever package manager that language community uses. Daff has so far been packaged for javascript/npm, ruby/gem, python/pypi, and php/composer. For java, a .zip is available, but it'd be more useful if daff were packaged for maven. There's a starting point mentioned in #39.

Option to output only Diff columns

For tables that have many columns, the Coopy format (of a few columns before and after the actual diff column) is making the output very large, and sort of unusable. Is it possible to add a parameter to restrict the output only to the diff columns (in addition to may be the ID columns that identify the row / row number)?

Thanks!

Shortcut command to combine daff, daff render and open

Thanks for daff! It's fantastic. I use the following alias often. Could it be a standard command? It would helpful for daff render to accept two arguments and diff those tables.

# Daff a TSV/CSV file and open it in a browser
daff-render()
{
    daff $1 $2 >! $1-$2
    daff render --output $1-$2.html $1-$2
    open $1-$2.html
}

slow with large data

Hi, I was very excited to see this program as I work on a collaborative data project where merging in changes provided by different members has become very difficult and tricky. I was able to get it up and running with python on linux and it worked great for the example you provide on the site. However when I tried changing just one cell in my own dataset which is very large (117K rows), it took over five hours to complete the merge. I am wondering if this is expected behavior for such a large dataset or if there might be some way to speed this up.

Thanks.

surprising diffs

I'm trying to add "placeholder" lines in the compared CSV for where GitHub leaves out data that's not changed. It's not an ideal solution, but it's OK for now. Problem is, if I add a row with all elements set to "..." into both old and new datasets, coopyhx seems to think that this has changed in both. If I set just some fields, it's OK, but if all are the same, it thinks everything has changed. Ever seen anything like this? Any ideas? I'll try to replicate with the basic javascript shortly.

ruby_table_view.rb is not included in the ruby gem

With Daff 1.3.6 installed through ruby gems I don't have ruby_table_view.rb which is required by \lib\coopy\coopy.rb.

Noisy diffs

First off, awesome looking tool! I'm very excited that I stumbled upon daff and coopy today.

I wanted to report that it seems like daff can generate pretty noisy diffs where shared lines are not detected as shared and thus contribute to larger diff hunks.

Consider the two example files, a and b, in this gist.

git diff --no-index --color a b produces a sensibly simple diff of mostly lines added, with a few cell values changed in a handful of lines.

daff diff --color a b produces much more churn in the diff hunks, presumably because of some threshold for shared lines.

Is this a tunable behaviour in any way?

Daff crashes with: UnicodeDecodeError: 'utf8'

Similar to an issue #31 I reported earlier but not the same trace.

Traceback (most recent call last):
  File "/usr/local/bin/daff.py", line 8626, in <module>
    Coopy.main()
  File "/usr/local/bin/daff.py", line 2797, in main
    return coopy.coopyhx(io)
  File "/usr/local/bin/daff.py", line 2621, in coopyhx
    b = tool.loadTable(python_internal_ArrayImpl._get(args, (1 + offset)))
  File "/usr/local/bin/daff.py", line 2015, in loadTable
    txt = self.io.getContent(name)
  File "/usr/local/bin/daff.py", line 7342, in getContent
    return sys_io_File.getContent(name)
  File "/usr/local/bin/daff.py", line 8405, in getContent
    content = f.read(-1)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 671, in read
    return self.reader.read(size)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

Ruby wrapper

Unfortunately Haxe doesn't have a compiler to Ruby, but as you've got a C++ build, it might be possible to wrap this in a Ruby gem. Have you considered this at all? It would certainly be nice to remove the javascript requirement in my gitlab implementation and do it all server-side.

html output broken: 0 byte file produced

Hi,
I've been using daff for about a year now to do quick regression tests on ETL processes and it is a huge help.

I've recently done a fresh install of daff on two different machines (and in both python and ruby), but in all cases the most recent version of daff seems to break html output.

Input:
a.csv

id, attribute
1, hello

b.csv

id, attribute
1, goodbye

call:

# ruby
$ daff.rb --output-format html --output test.html a.csv b.csv

# python3
$ daff.py --output-format html --output test.html a.csv b.csv

with both calls test.html is created, but it is a 0 byte empty file.
However, if I call daff.py --output-format csv --output test.csv a.csv b.csv the diff is produced in csv just fine.

OS:
OSX Yosemite 10.10.4（14E46）

Any help would be awesome, and thank you again for such a great tool.

Web Development Use Case

I am curious about your thoughts on the future of tabular diffs. Currently as a LAMP developer, if you use git to manage your code base you have to resort to some extra tricks to get your DB changes turned into code and back again when moving from development to production. For example, in Drupal there is a system called Features to help do this.

However, one could imagine a scenario where someone could run both types of diffs and push them directly. Do you think that would that work and do you have a plan to make this thing the git of data?

Changed line appears as added + removed

The daff comparison algorithm improperly marks a row with changed data as an added/removed pair.

For instance, comparing the CSV files 'iris.csv' and 'iris2.csv' (via the edwinj/daff R wrapper), I get the following diff:

@@	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
...	...	...	...	...	...
	5.7	2.8	4.1	1.3	versicolor
->	6.3	3.3	6	2.5	virginica->XXX
+++	5.8	2.7	5.1	1.9	XXX
---	5.8	2.7	5.1	1.9	virginica
->	7.1	3	5.9	2.1	virginica->XXX
->	6.3	2.9	5.6	1.8	virginica->XXX
->	6.5	3	5.8	2.2	virginica->XXX
->	7.6	3	6.6	2.1	virginica->XXX
->	4.9	2.5	4.5	1.7	virginica->XXX
->	7.3	2.9	6.3	1.8	virginica->XXX
->	6.7	2.5	5.8	1.8	virginica->XXX
->	7.2	3.6	6.1	2.5	virginica->XXX
->	6.5	3.2	5.1	2	virginica->XXX
->	6.4	2.7	5.3	1.9	virginica->XXX
->	6.8	3	5.5	2.1	virginica->XXX
->	5.7	2.5	5	2	virginica->XXX
->	5.8	2.8	5.1	2.4	virginica->XXX
->	6.4	3.2	5.3	2.3	virginica->XXX
->	6.5	3	5.5	1.8	virginica->XXX
->	7.7	3.8	6.7	2.2	virginica->XXX
->	7.7	2.6	6.9	2.3	virginica->XXX
->	6	2.2	5	1.5	virginica->XXX
->	6.9	3.2	5.7	2.3	virginica->XXX
->	5.6	2.8	4.9	2	virginica->XXX
->	7.7	2.8	6.7	2	virginica->XXX
->	6.3	2.7	4.9	1.8	virginica->XXX
->	6.7	3.3	5.7	2.1	virginica->XXX
->	7.2	3.2	6	1.8	virginica->XXX
->	6.2	2.8	4.8	1.8	virginica->XXX
->	6.1	3	4.9	1.8	virginica->XXX
->	6.4	2.8	5.6	2.1	virginica->XXX
->	7.2	3	5.8	1.6	virginica->XXX
->	7.4	2.8	6.1	1.9	virginica->XXX
->	7.9	3.8	6.4	2	virginica->XXX
->	6.4	2.8	5.6	2.2	virginica->XXX
->	6.3	2.8	5.1	1.5	virginica->XXX
->	6.1	2.6	5.6	1.4	virginica->XXX
->	7.7	3	6.1	2.3	virginica->XXX
->	6.3	3.4	5.6	2.4	virginica->XXX
->	6.4	3.1	5.5	1.8	virginica->XXX
->	6	3	4.8	1.8	virginica->XXX
->	6.9	3.1	5.4	2.1	virginica->XXX
->	6.7	3.1	5.6	2.4	virginica->XXX
->	6.9	3.1	5.1	2.3	virginica->XXX
+++	5.8	2.7	5.1	1.9	XXX
---	5.8	2.7	5.1	1.9	virginica
->	6.8	3.2	5.9	2.3	virginica->XXX
->	6.7	3.3	5.7	2.5	virginica->XXX
->	6.7	3	5.2	2.3	virginica->XXX
->	6.3	2.5	5	1.9	virginica->XXX
->	6.5	3	5.2	2	virginica->XXX
->	6.2	3.4	5.4	2.3	virginica->XXX
->	5.9	3	5.1	1.8	virginica->XXX

As you can see, the pair of lines

+++	5.8	2.7	5.1	1.9	XXX
---	5.8	2.7	5.1	1.9	virginica

are shown as an addition + deletion, when they are actually a change in a single column.

For some large files--but not in this file--I see trios or more complex patterns of added/deleted/modified lines where changes in the values in two or more rows are displayed as a mix of modifications to unmatched rows, combined with additions + deletions. Something like:

+++	5.8		2.7		5.1		1.9		XXX
-->	6.8->5.8	3.2-->2.7	5.9->5.1	2.3->1.9	virginical->XXX
---	5.8		3.2		5.1		1.9		virginica

CSS confusion

I'm getting a little confused applying styles to the diffs.

When a column is added, the add class is only applied to the added cell. When a whole row is added, the add class is applied only to the row tag, not the individual cells.

However, when a row is modified, the style is applied to both the cells and the row as a whole, like this:

<tr class="modify">
  <td class="modify">→</td>
  <td>Triton</td>
  <td class="add"></td>
  <td class="modify"> 0.779→0.779</td>
</tr>

Note that only the last column has been modified here. Applying the modify class to the row is therefore a bit confusing, as I wouldn't really want to highlight the whole row.

What are the rules regarding how the styles are applied? Do you get whole-row modify states? If not, I can maybe just ignore the tr.modify and only apply row styles for additions and deletions.

DiffRender.render doesn't seem to set class on added cells in modified rows

As found when working on gitlabhq/gitlabhq#4810:

Compare these two screenshots. The first is using handsontable to render, the second using DiffRender.render() to get html.

As you can see, in the first one, on the last two rows, fields have been modified, but the central column has also been added.

In the second case, using DiffRender.render() directly, no class is applied to the added column on the last two rows. The HTML looks like:

<tr>
  <td>+</td>
  <td>Neptune</td>
  <td class="add">4553946490</td>
  <td>11.28</td>
</tr>
<tr>
  <td class="modify">-&gt;</td>
  <td>Triton</td>
  <td></td>
  <td class="modify"> 0.779-&gt;0.779</td>
</tr>
<tr>
  <td class="modify">-&gt;</td>
  <td>Pluto</td>
  <td>7311000000</td>
  <td class="modify"> 0.61-&gt;0.61</td>
</tr>

I would expect the third td in each of the last two rows to have the add class as well.

Feature request: Ignore whitespace (-w) option

I don't know if this belongs in daff or coopy (and if in coopy, where), but it'd be great if daff diff supported an --ignore-whitespace / -w option which ignored whitespace changes when producing a diff. Among other uses, this would help with auditing whitespace-only changes such as stripping leading/trailing spaces from field values. I find GNU diff and git diff's ignore whitespace options to be invaluable!

`daff git csv` needs testing by others

daff has grown a new command daff git csv that installs it as a csv diff handler for git (like @Floppy's csv-my-git) and also as a csv merge handler (like coopy). I think the code should work in any command-line-oriented environment, with git and daff in the user's path, but it could definitely do with some broader testing.

Option to not prune unchanged columns

--all prevents pruning on unchanged rows, however it stills prune unchanged columns. Can we have an option to not prune unchanged columns?

API for summary of changes

Hi,

TLDR: Please add an API for summarizing changes

Details:

I've added code to the R interface (edwindj/daff via pulls from gwarnes-mdsol/daff) to generate a summary table of the number of added/removed/changed rows and columns in a TableView filled by TableDiff.highlite, e.g:

Data diff:
 Comparison: ‘y’ vs. ‘x’ 
        #       Changed Removed Added
Rows    3 --> 2 1       1       0    
Columns 5       0       1       1

This is currently accomplished by identifying the row and column containing flags in Table.get_data() and then counting the occurrences of each flag type. This is inefficient and error-prone.

Please add appropriate API calls to obtain this information. Perhaps one or more methods to TableDiff? Something like:

TableDiff.getNumAddedRows(): Int
TableDiff.getNumAddedColumns(): Int

TableDiff.getNumDeletedRows(): Int
TableDiff.getNumDeletedColumns(): Int

TableDiff.getNumModifiedRows(): Int
TableDiff.getNumModifiedColumns(): Int

TableDiff.getNumReorderedRows(): Int
TableDiff.getNumReorderedColumns(): Int

TableDiff.getNumRows(): Array<Int>      /* length two:  ( local, remote ) */
TableDiff.getNumColumns(): Array<Int> /* length two:  ( local, remote ) */

Diff corrupt if 2 columns are equal

Hello, your daff tool is very comfortable - great work - but i have a problem with it:

Assume the following file:
A,B,C
a,b,c

Change the file to:
A,B,C,C
a,b,c,c

.. then "git diff" shows that the last column has been removed.

Git Diff :: ENOENT error

I've just installed the NPM version (1.3.17) of the app. I'm trying to run a git diff between two commits BUT I have the following error :

Error: ENOENT: no such file or directory, open 'C:\Users\chuiv\PhpstormProjects\
    at Error (native)
    at Object.fs.openSync (fs.js:584:18)
    at Object.fs.readFileSync (fs.js:431:33)
    at Object.tio.getContent (C:\Users\chuiv\AppData\Roaming\npm\node_modules\da
    at Object.coopy.Coopy.loadTable (C:\Users\chuiv\AppData\Roaming\npm\node_mod
    at Object.coopy.Coopy.run (C:\Users\chuiv\AppData\Roaming\npm\node_modules\d
    at Object.coopy.Coopy.coopyhx (C:\Users\chuiv\AppData\Roaming\npm\node_modul
    at run_daff_base (C:\Users\chuiv\AppData\Roaming\npm\node_modules\daff\bin\d
    at Object.exports.run_daff_main (C:\Users\chuiv\AppData\Roaming\npm\node_mod
bin\daff.js:8817:14)
fatal: external diff died, stopping at DiscountRates/Additional_parameters.csv

Do you know why ?

Configure `git diff` to use a different format when piped into a pager

I often use MacVim as my pager like so git diff |gview -. The ANSI escape colour codes don't display well in Vim. My current configuration (through zsh prezto), git diff displays pretty colours by default, and no colours when piped into a pager like git diff | less. Can daff be configured like that?

Daff python version does not create a daff executable.

Running pip install daff creates a daff.py executable but not a daff executable. Missing the daff executable breaks the scripts.

Feature Request: Immutable Columns

Is there a way to specify that certain columns are immutable when comparing tables?

More concretely, this would result in a diff that deletes a row and then inserts a row, whenever a field seems to have been updated in a column specified as immutable.

I need this for a project of mine and don't mind coding it for everyone if it is missing.

daff.Table in javascript

Hey

I am trying to use daff.Ndjson, but it requires a Table instance to start with.

    var data1 = new daff.Table()
    var data2 = new daff.Table()
    var table1 = new daff.Ndjson(data1)
    var table2 = new daff.Ndjson(data2)

Gives me an error -- Table isn't defined. I've tried to use TableView and SimpleView, but I get

TypeError: Object #<Object> has no method 'get_width'

Thanks

daff.TerminalDiffRender in javascript

Hey paul,

sorry to keep buggin' ya :)

it looks like TerminalDiffRender isn't exposed in javascript. Can you expose it?

cheers,
Karissa

Display of unicode rightwards arrow in Terminal

I am using daff with git to diff tab-delimited files. I get the following output when diff-ing a simple table (instead of seeing a rightwards arrow):

@@, hello,world
,   1,    2
+++,8,    9
---,3,    4
<E2><86><92>,  5<E2><86><92>0,  6

Note that I can run echo -e "\xe2\x86\x92" from my terminal and get the expected rightwards arrow.

Any ideas what might cause this?

Here are the lines from my .gitconfig:

[diff "daff-tab"]
    command = daff diff --input-format tsv --git

option to set line endings

From RFC for CSV and Tabular Data Package Definition, we can see CRLF as well as LF should be allowed in CSV files. For now daff will only output CSV files with CRLF line endings, which may cause problems when working with files using LF. (For example, a program I am working on breaks recently because it assume LF line endings but they changed to CRLF after git merge)

I think it should be good to have options for setting line endings, either allowing user to choose or automatically add eol according to the operation system. What do you think?

git diff driver with renamed files dies with “Expected 7 parameters from git, but got 9.”

trsibley@mullion-13 reports (master +=) $ GIT_TRACE=1 git diff --cached
09:18:20.051692 git.c:349               trace: built-in: git 'diff' '--cached'
09:18:20.054729 run-command.c:351       trace: run_command: '/usr/bin/less'
09:18:20.055439 run-command.c:199       trace: exec: '/usr/bin/less'
09:18:20.061627 run-command.c:351       trace: run_command: 'daff diff --color --git' 'reports/autopsy_quality.csv' '/var/folders/9_/78fx8kdx6zg3pkrr7pg3lb7r0000gq/T//8KWu29_autopsy_quality.csv' '76ad6f051d89f2fcfbd75395d
rename from reports/autopsy_quality.csv
rename to reports/autopsy_quality_dna.csv
index 76ad6f0..22ec65a 100644
'
09:18:20.062293 run-command.c:199       trace: exec: '/bin/sh' '-c' 'daff diff --color --git "$@"' 'daff diff --color --git' 'reports/autopsy_quality.csv' '/var/folders/9_/78fx8kdx6zg3pkrr7pg3lb7r0000gq/T//8KWu29_autopsy_
rename from reports/autopsy_quality.csv
rename to reports/autopsy_quality_dna.csv
index 76ad6f0..22ec65a 100644
'
Expected 7 parameters from git, but got 9
fatal: external diff died, stopping at reports/autopsy_quality.csv
trsibley@mullion-13 reports [128] (master +=) $

Update npm package

Currently in daff npm package (version 1.3.18):

// node_modules/daff/lib/daff.js:8080
/* $hx_exports = */ typeof window != "undefined" ? window : exports

// fix_exports.js
if (typeof exports != "undefined") {
  // ...
} else {
  // ...
}

This code causes problem when there are both window and exports.
Coopy is exported to window by $hx_exports, but 'fix_exports.js' looks up coopy in exports.

With Haxe Compiler 3.3.0, However, coopy is exported to exports first.

/* $hx_exports = */ typeof exports != "undefined" ? exports : typeof window != "undefined" ? window : typeof self != "undefined" ? self : this

// fix_exports.js
if (typeof exports != "undefined") {
  // ...
} else {
  // ...
}

It would be better to rebuild the package with new haxe compiler and publish again.

Error with null fields in modified rows using DiffRender.render

If a modified row includes a null column, the diff renderer throws an error: Uncaught TypeError: Cannot call method 'indexOf' of null on line 1346.

This is caused by trying to call indexOf on the txt variable, which in this case is null. Lines 1337 and 1338 seem to try to handle null values, but in my case the txt object is actually null, not "NULL" or "null", so it's not being modified here.

I fixed it by adding a similar check for null, shown in theodi/gitlabhq@430ce0a#L2R1339, but I'm not sure this is a great solution; you can probably come up with something neater, knowing exactly what's going on here!

UnicodeEncodeError: 'ascii' codec can't encode character

Hey, I am facing an exception when using daff. Traceback:

Traceback (most recent call last):
  File "/usr/local/bin/daff.py", line 8626, in <module>
    Coopy.main()
  File "/usr/local/bin/daff.py", line 2797, in main
    return coopy.coopyhx(io)
  File "/usr/local/bin/daff.py", line 2650, in coopyhx
    tool.saveText(output,render.render(o))
  File "/usr/local/bin/daff.py", line 2011, in saveText
    self.io.writeStdout(txt)
  File "/usr/local/bin/daff.py", line 7352, in writeStdout
    python_lib_Sys.stdout.write(txt)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2192' in position 1782: ordinal not in range(128)

I already check and the file has no unicode characters. It is only ascii.

Issue while converting daff library to java

Hi,
I'm trying to use the daff library in a java project to do differences on certain objects and publish an htm report. For doing that i need to convert the current project into a java project / jar to be included as a library.
But the conversion of daff project from haxe to java isn't happening. Installed the haxe library on my windows systems as described here http://old.haxe.org/download and followed the instruction to making a java project from here : http://old.haxe.org/doc/start/java.
Getting the following error :- ./coopy/SqlHelper.hx:11: characters 4-92 : Class not found : Map
when I run this command :- >haxe -main coopy.Coopy -java <sys_path>\daff_java -cp src -D coopyhx_util.
Any idea what I could be doing wrong or something special needed to be done to use the library as a java project?

table to visual

I wrote this thing to convert a couple daff tables to a visual for convenience. How do you feel about putting it somewhere in daff core (or maybe extend it even to a version that is smart about 3-table diffs, too).

If not, I'll open an npm module for it

https://github.com/karissa/dat-visualdiff/blob/master/lib/dat2daff.js#L33

function tablesToVisual (tables, opts) {
  var flags = new daff.CompareFlags();

  var table1 = tables[0]
  var table2 = tables[1]

  var alignment = daff.compareTables(table1, table2, flags).align();
  var highlighter = new daff.TableDiff(alignment,flags);
  var table_diff = new daff.SimpleTable();
  highlighter.hilite(table_diff);

  if (opts.html) {
    var diff2html = new daff.DiffRender();
    diff2html.render(table_diff);
    var table_diff_html = diff2html.html();
    return table_diff_html
  }
  else {
    var diff2terminal = new daff.TerminalDiffRender();
    return diff2terminal.render(table_diff)
  }
}