Coder Social home page Coder Social logo

thresh's Introduction

thresh (verb): to separate the wheat from the chaff.

Thresh aims to make the processing, manipulating, and analysis of tabular data easy and fun! It allows you to get rid of what you don't want (the chaff) and are left with what you do want (the wheat).

Examples of possible operations are: extracting columns, manipulating columns, generating columnar data, converting file formats, and making asserts about the data.

Quick Start Examples

See what columns are in a file in human-readable format:

thresh data_1.txt list

Print column names, one per line (useful for bash for-loops):

thresh data_2.csv headerlist

Print to stdout only the columns 'time' and 'stress':

thresh data_1.txt cat time stress

Print to stdout only the columns 'time' and 'stress' in CSV format:

thresh data_1.txt cat time stress print .csv

Save to CSV format only the columns 'time' and 'stress':

thresh data_1.txt cat time stress output data_out.csv

Read in from stdin (use -.csv for CSV format):

cat data_1.txt | thresh - cat time stress

Print the whole file and add a millisecond column called 'mtime':

thresh fizz_=data_1.txt cat fizz_ 'mtime=1000*time'

Print the whole file, minus column 'stress':

thresh A=data_1.txt cat A stress=None

Make an analytic solution with columns 'time' and 'wave':

thresh cat 'time=linspace(0,1,10)' 'wave=sin(t)'

Interpolate data:

thresh in_=data_1.txt cat \
    'time=linspace(min(in_time),max(in_time),100)' \
    'stress=interp(time,in_time,in_stress)'

Do a simple assert on the data (return code 0 if True, 1 if False):

thresh data_1.txt assert 'np.max(np.abs(stress)) < 2.0'

Reading in JSON and making an analytic solution:

thresh foo.json cat "time=[0,1,2]" "stress=stress_mag * np.sin(time)"

Reading in text, CSV, and JSON for an assert:

thresh JSON_=foo.json CSV_=bar.csv TXT_=baz.dat \
    assert "JSON_var1 + CSV_var2 + TXT_var3 == 1.23"

Listing Column Headers

Note: you cannot list more than one file at a time.

See all columns in a file in a simple list.

$ thresh column_data_1.txt headerlist
time
strain
stress

See all columns in a file with extra info in human-readable format.

thresh column_data_1.txt list
 col | length | header
----------------------
   0 |      4 | time
   1 |      4 | strain
   2 |      4 | stress

See the columns of the file you create.

thresh A=data_1.txt cat A 'mtime=1000*time' list
 col | length | header
----------------------
   0 |      4 | time
   1 |      4 | strain
   2 |      4 | stress
   3 |      4 | mtime

Listing a JSON file just gives a pretty-printed version of the file.

thresh data.json list
{'magnitude': 1.23}

Loop over headers (both lines are equivalent).

$ for COL in `thresh column_data_1.txt headerlist`; do echo Found column $COL; done
$ thresh column_data_1.txt headerlist | while read COL; do echo Found column $COL; done
Found column time
Found column strain
Found column stress

Extracting Columns: Rules

Aliases are included to allow disambiguation of columns with the same name in different files. For non-ambiguous column names, you can use the aliased name or the non-aliased name.

Rules governing setting aliases:

  • The alias must be a valid python identifier (variable name)
  • The alias must not be a python keyword ('for', 'while', etc)
  • The alias cannot conflict with a column name in any input file
  • The alias cannot conflict with another alias

Some of these rules can be broken and will not cause any problems unless you try to use an ambiguous name/alias. For example, if one file has a column named 't' and you try to alias a file to 't', you won't get an error unless you try to use the 't' descriptor.

Extracting Columns with 'cat'

These are all equivalent and print all the columns.

thresh data_1.txt
thresh data_1.txt cat time strain stress
thresh A=data_1.txt cat A
thresh A=data_1.txt cat Atime Astrain Astress
thresh A=data_1.txt cat Atime strain stress

These are equivalent (concatenate both files together with no repeated column names).

thresh data_1.txt data_2.txt
thresh A=data_1.txt B=data_2.txt cat A B
thresh A=data_1.txt B=data_2.txt cat time Astrain stress Bt eps Bsig

These are equivalent (all of one file and one column of another). thresh A=data_1.txt data_2.txt cat A sig thresh A=data_1.txt B=data_2.txt cat A sig thresh A=data_1.txt B=data_2.txt cat A Bsig thresh A=data_1.txt B=data_2.txt cat Atime Astrain Astress Bsig thresh A=data_1.txt data_2.txt cat Atime strain stress sig

Manipulating Columns

create a new file with a single column called 'mtime' which is milliseconds (all equivalent).

thresh data_1.txt cat mtime=1000*time
thresh A=data_1.txt cat mtime=1000*time
thresh A=data_1.txt cat mtime=1000*Atime

Create a new column based on data from a file and then use that new column to create another column.

thresh data_1.txt cat \
  'dstress=np.diff(stress)' \
  'dt=np.diff(time)' \
  'stress_rate=dstress / dt'

Creating New Files With No Input File

Create a new file that with numbers and their squares.

thresh cat 't=arange(1,6,1)' 'squares=t**2'
WARNING: No files to read in.
                         t                   squares
  +1.00000000000000000e+00  +1.00000000000000000e+00
  +2.00000000000000000e+00  +4.00000000000000000e+00
  +3.00000000000000000e+00  +9.00000000000000000e+00
  +4.00000000000000000e+00  +1.60000000000000000e+01
  +5.00000000000000000e+00  +2.50000000000000000e+01

Create a new file that has a sine wave and a noisy sine wave.

thresh cat \
  't=linspace(0.0,pi,100)' \
  'sine=sin(t)' \
  'noisey=sine+random.uniform(-1.0,1.0,len(sine))'

Performing an Assert

In some instances, you will want to make checks/asserts on the data and get feedback in the form of a return code (like for automated tests). One or more assert statements can be made and compound statements are okay. The returned value is cast to a boolean and the program terminates with a return code of 0 if it evaluates to True and 1 if it evaluates to False.

Do a simple assert on the data.

thresh data_1.txt assert "abs(max(a)-6.0) < 1.0e-6"

Do an assert on data in a JSON file.

thresh data.json assert "foo == 123"

Do a less simple assert

thresh data_1.txt \
    cat 'stress_rate=np.diff(stress)/np.diff(time)' \
    assert 'np.max(np.abs(stress_rate)) < 2.0'

Use multiple asserts (all asserts must pass for 0 return code).

thresh data_1.txt \
    cat 'stress_rate=np.diff(stress)/np.diff(time)' \
    assert \
        'np.max(np.abs(stress_rate)) < 2.0' \
        'np.all(strain >= 0)'

Use a compound statement.

thresh data_1.txt \
    cat 'stress_rate=np.diff(stress)/np.diff(time)' \
    assert 'np.max(np.abs(stress_rate)) < 2.0 or np.all(strain >= 0)'

Saving output

Several different output formats are supported.

Regular whitespace-delimited otuput to stdout:

thresh data_1.txt print

CSV output to stdout

thresh data_1.txt print .csv

Regular whitespace-delimited otuput to foo.txt

thresh data_1.txt output foo.txt

CSV output to foo.csv

thresh data_1.txt output foo.csv

Using columns with special characters

Some column names will have special characters that would make the column name invalid in python syntax. The unaliased bad name can be used for including the whole column but cannot be used in calculations.

thresh data.txt cat "-bad_name%"

The work-around for using the column in a calculation (cat or assert) requires that the file in question be aliased and then accessed via the special __aliases dictionary:

thresh foo_=data.txt assert \
  "max(__aliases['foo_']['-bad_name%']) > 1"

Note: While columns with special names may be accessed this way, they cannot be assigned in this way.

thresh's People

Contributors

sswan avatar

Watchers

 avatar

thresh's Issues

Automatically populate columns to assert

When you do:

thresh A=foo.txt B=bar.csv assert "np.allclose(Acol1,Bcol1)"

The columns are not automatically populated into the "assert" namespace, but they should be.

Allow multiple asserts

Right now, only a single assert is allowed. We want to be able to do:

thresh A=foo.txt B=bar.csv assert \
  "np.allclose(Acol1,Bcol1)" \
  "np.max(Acol2) == np.max(Bcol2)" \
  "len(Acol3) == 24"

This allows better feedback on what when wrong instead of how it had to be done before where it was all strung together on one line with "and"s.

preload json file into the namespace

We need a way to get a dict of variables into the thresh namespace to help us in doing asserts. Maybe something like this:

$ cat vars.json
{
  "density_gold": 123.456
}
$ thresh preload=vars.json A=data.txt cat "errors=density_gold-Adensity" assert "max(errors) < 1.0e-6"

Make default output to have 17 digits of precision

Doubles require 17 digits of precision to exactly reproduce what was in memory. For example: (in-memory double) --> (print with 17 digits scientific notation) --> (read in text) --> (reproduce original in-memory double)

Single row txt and csv files are corrupted

It looks like the txt and csv parsers choke on single-row files:

mswan@cee-pp-ldrd02 $ cat bad_single_row.txt 
a b c
1 2 3
mswan@cee-pp-ldrd02 $ thresh bad_single_row.txt 
                         a
  +1.00000000000000000e+00
  +2.00000000000000000e+00
  +3.00000000000000000e+00
mswan@cee-pp-ldrd02 $ cat bad_single_row.csv 
a,b,c
1,2,3
mswan@cee-pp-ldrd02 $ thresh bad_single_row.csv
                         a
  +1.00000000000000000e+00
  +2.00000000000000000e+00
  +3.00000000000000000e+00
mswan@cee-pp-ldrd02 $ thresh cat "a=[1]" "b=[2]" "c=[3]"
WARNING: No files to read in.
                         a                         b                         c
  +1.00000000000000000e+00  +2.00000000000000000e+00  +3.00000000000000000e+00

Allow `output` and `assert` in the same command

Allow something like

thresh foo.txt cat u1=col2 u2=col1 output gen.csv assert "np.linalg.norm(u1-u2) < 1.0e-6"
thresh foo.txt cat u1=col2 u2=col1 assert "np.linalg.norm(u1-u2) < 1.0e-6" output gen.csv

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.