alexmili / torch-dataframe Goto Github PK

Utility class to manipulate dataset from CSV file

License: MIT License

Lua 97.59% Shell 0.55% Jupyter Notebook 1.67% CMake 0.19%

torch torchnet dataframe csv

torch-dataframe's Introduction

Dataframe

Dataframe is a Torch7 class to load and manipulate tabular data (e.g. Kaggle-style CSVs) inspired from R's and pandas' data frames.

As of release 1.5 it fully supports the torchnet data structure. It also has custom iterators to convenient integration with torchnet's engines, see the mnist example. As of release 1.6 it has changed the internal storage to tensor

For a more detailed look at the changes between the versions have a look at the NEWS file.

Requirements
Installation
Changelog
Usage
Tests
Documentation
Contributing

Requirements

Installation

You can clone this repository or directly install it through luarocks:

git clone https://github.com/AlexMili/torch-dataframe
cd torch-dataframe
luarocks make rocks/torch-dataframe-scm-1.rockspec

the same in one line :

luarocks install torch-dataframe scm-1

luarocks install torch-dataframe

Changelog

Version: 1.7

Added faster torch.Tensor functions to fill/stat functions for speed
Added mutate function to Dataseries
__index__ access for Df_Array
More complete documentation for Df_Array and specs
Df_Dict elements can be accessed using myDict[index] or myDict["$colname"]
Df_Dict key property available. It list the Df_Dict's keys
Df_Dict length property available. It list by key, the length of its content
Df_Dict check_length() checks if all elements have the same length
Df_Dict set_keys(table) replaces every keys by the given table (must be the same size)
More complete documentation for Df_Dict and specs
More complete documentation for Df_Tbl and specs
Internal methods _infer_csvigo_schema() and _infer_data_schema() renamed to _infer_schema()
Type inference is now based on type frequences but if it encounter a single double/float in a integer column it will consider the column as double/float
it is now possible to directly set a schema for a Dataframe without any checks with set_schema(). Use it wisely
Possibility to init a Dataframe with a schema, a column order and a number of rows with internal method _init_with_schema()
Added bulk_load_csv() method wich loads large CSVs files using threads but without checking missing values or data integrity. To use with caution. See #28
Added load_threadcsv()
Added the possiblity to create empty Dataseries
Added Dataseries load() method to directly load a tensor or tds.Vec in memory without any check
Added iris dataset in /specs/data
New specs structure
Fixed csv loading when no header and test case according to it
Changed assert_is_index return value to true on success instead of self

See NEWS.md file for previous changes.

Usage

Named arguments

The Dataframe relies on argcheck for parsing arguments. This means that you can used named parameters using the function{arg_name=value} syntax. Named arguments are supported by all functions except the constructor and is in certain functions mandatory in order to avoid ambiguity.

The argcheck package also works as the API documentation. It checks arguments and if you happen to provide the function with invalid arguments it will automatically output the function documentation.

Important: Due to limitations in the Lua language the package uses helper classes for separating regular table arguments from tables passed into as arguments. The three classes are:

Df_Array - contains only values and no keys
Df_Dict - a dictionary table that has named keys that map to all values
Df_Tbl - a raw table wrapper that does a shallow argument copy

Load data

Initiate the object:

require 'Dataframe'
df = Dataframe()

Load CSV file:

df:load_csv{path='./data/training.csv', header=true}

Load from table:

df:load_table{data=Df_Dict{firstColumn={1,2,3},
                           secondColumn={4,5,6}}}

You can also instantiate the object with a csv-filename or a table by passing the table or filename as an argument:

require 'Dataframe'
df = Dataframe('./data/training.csv')

Data inspection

You can discover your dataset with the following functions:

-- you can either view the data as a plain text output or itorch html table
df:output() -- prints html if in itorch otherwise prints plain table
df:output{html=true} -- forces html output

df:show() -- prints the head + tail of the table

-- You can also directly call print() on the object
-- and it will print the ascii-table
print(df)

General dataset information can be found using:

df:shape() -- print {rows=3, cols=3}
#df -- gets the number of rows
df:size() -- returns a tensor with the size rows, columns
df.column_order -- table of columns names
df:count_na() -- print all the missing values by column name

If you want to inspect random elements you can use the get_random():

df:get_random(10):output()

Manipulate

You can manipulate it:

df:insert(Df_Dict({['first_column']={7,8,9},['second_column']={10,11,12}}))
df:remove_index(3) -- remove line 3 of the entire dataset

df:has_column('x') -- return true if the column exist
df:get_column('y') -- return column x as table
df["$y"] -- alias for get_column

df:add_column('z', 0) -- Add column with default value 0 at the end (right side of the table)
df:add_column('first_column', 1, 2) -- Add column with default value 2 at the beginning (left side of the table)
df:drop('x') -- delete column
df:rename_column('x', 'y') -- rename column 'x' in 'y'

df:reset_column('my_col', 0) -- reset the given column with 0
df:fill_na('x', 0) -- replace missing values in 'x' column with 0
df:fill_all_na(0) -- replace all missing values with the value 0

df:unique('col_name') -- return table with unique values of the given column
df:unique('col_name', true) -- return table with unique values of the given column as keys

df:where('column_name','my_value') -- find the first row where the column has the given value

-- Customly update all rows filling the condition defined in first lambda
df:update(function(row) row['column'] == 'test' end,
          function(row) row['other_column'] = 'new_value' return row end)

Categorical variables

You can define categorical variables that will be treated internally as numbers ranging from 1 to n levels while displayed as strings. The numeric representation is retained when exporting to_tensor allowing a simpler understanding of a classifier's output:

df:as_categorical('my string column') -- converts a column to categorical
df:get_cat_keys('my string column') -- retreives the keys used to converts
df:to_categorical(Df_Array({1,2,1}), 'my string column') -- converts numbers to the categories

Subsetting

You can subset your data using:

df:head(20) -- print 20 first elements (10 by default)
df:tail(5) -- print 5 last elements (10 by default)
df:show() -- print 10 first and 10 last elements

df[13] -- returns a table with the row values
df["13:17"] -- returns a Dataframe with values in that span
df["13:"] -- returns a Dataframe with values starting from index 13
df[Df_Array(1,3,4)] -- returns a Dataframe with values index 1,3 and 4

Exporting

Finally, you can save your dataset to tensor (only numerical/categorical columns will be taken):

df:to_tensor{filename = './data/train.th7'} -- saves data
data = df:to_tensor{columns = Df_Array('first_column', 'my string column')} -- Converts the two columns into tensor

or to CSV:

df:to_csv('data.csv')

Batch loading

The Dataframe provides a built-in system for handling batch loading. It also has an extensive set of samplers that you can use. See API docs for more on which that are available.

The gist of it is:

The main Dataframe is initialized for batch loading via calling the create_subsets. This creates random subsets that have their own samplers. The default is a train 70%, validate 20%, and a test 10% split in the data but you can choose any split and any names.
Each subset is a separate dataframe subclass that has two columns, (1) indexes with the corresponding index in the main dataframe, (2) labels that some of the samplers require.
When you want to retrieve a batch from a subset you call the subset using my_dataframe:get_subset('train'):get_batch(30) or my_dataframe['/train']:get_batch(30).
The batch returned is also a subclass that has a custom to_tensor function that returns the data and corresponding label tensors. You can provide custom functions that will get the full row as an argument allowing you to use e.g. a filename that permits load an external resource.

A simple example:

local df = Dataframe('my_csv'):
	create_subsets()

local batch = df["/train"]:get_batch(10)
local data, label = batch:to_tensor{
	load_data_fn = my_image_loader
}

As of version 1.5 you may also want to consider using th iterators that integrate with the torchnet infrastructure. Take a look at the iterator API and the mnist example for how an implementation may look.

Tests

The package contains an extensive test suite and tries to apply a behavior driven development approach. All features should be accompanied by a test-case.

To launch the tests you need to install busted (See: Olivine-Labs/busted) via luarocks:

luarocks install busted

then you can run all tests via command line:

cd specs/
./run_all.sh

Documentation

The package relies on self-documenting functions via the argcheck package that reside in the doc folder. The GitHub Wiki is intended for more extensive in detail documentation.

To generate the documentation please run:

th doc.lua > /dev/null

Contributing

See CONTRIBUTING.md for further details.

torch-dataframe's People

Contributors

Stargazers

Watchers

Forkers

guilhermehartmann lijian8 deeplearningsprint vincent-herlemont gaoxiaojun thgngu tgowthambits

torch-dataframe's Issues

Improved manual

The README is currently lagging. We need to cover:

batch functionality
exemplify how to set up a network using the dataframe
the basic statistics available

... and more. I'm not sure if it's best to build a larger README or if we should start a doc-folder. Anyway the README requires a TOC to start with.

Code organization

I was thinking about a better way to organize the growing main file and tests. It's getting bigger and bigger after each of your great contributions, which is great. But it's also getting harder and harder to read and follow up.
I was thinking about splitting each file in multiples one but I don't know yet the splitting logic. Do you have any idea ?

Make where() returns all matching entries

One thing that I think should be fixed is the where, it currently fetches the first match - the Pandas where returns all matching entries. One way would be to have a private _where that returns the index numbers that the update() also uses.
@gforge

Error showing df after dropping column.

This exhibits an issue I'm having working with the df after manipulating it. Perhaps 'firstColumn' is not being removed from a data structure after being dropped.

require 'Dataframe'
df = Dataframe()
df:load_table{data=Df_Dict{firstColumn={1,2,3}, secondColumn={4,5,6}}}
df:show() --Shows table as expected. 
df:drop('firstColumn')
df:show() --Error case.

The error:

/Users/tu/torch/install/bin/luajit: ...ch/install/share/lua/5.1/Dataframe/Extensions/column.lua:191: Could not find column: firstColumn
stack traceback:
        [C]: in function 'assert'
        ...ch/install/share/lua/5.1/Dataframe/Extensions/column.lua:191: in function 'get_column'
        ...ch/install/share/lua/5.1/Dataframe/Extensions/output.lua:145: in function '__tostring__'
        ...ch/install/share/lua/5.1/Dataframe/Extensions/output.lua:43: in function 'output'
        ...ch/install/share/lua/5.1/Dataframe/Extensions/output.lua:66: in function 'show'
        run.lua:7: in main chunk
        [C]: in function 'dofile'
        ...s/tu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x01005f5d20

Converting categorical values (strings) to integers while keeping a translation table

Quite frequently data in a CSV represents a non-numeric variables, e.g. male/femal, dog/cat/horse and it would be useful to import these as strings, convert them into integers between 1 and #myDF:unique("String Col")), i.e. the behaviour of pandas categorical dtype.

Keys

The keys for the conversion should be saved in a separate translator table initiated in the __init(). There are generally two directions for factors to/from numeric value. I suggest that one table, the to numeric is kept in a self.categorical that has keys according to column names.

New functions

Dataframe:as_categorical A factor conversion function for populating the self.factors combined with converting the values to numerics. Updates also the self.schema.
Dataframe:to_categorical Takes a number, a tensor or a table and converts it to a string value or a string table if length > 1
Dataframe:from_categorical Takes a string, or a table of strings and converts them to a factor according to the key value

Adaptation of functions

update/insert/set Any change must be checked. If a numeric value is entered in a categorical column, should this be accepted or should we force the input to match the key table?
load_csv/load_table should call __init() before running to clear all data including the categorical table
_refresh_metadata should also include a check if the categorical is still present in the dataset
get_column should have an option of getting categorical_as_string = true for forcing strings as default.
reset_column should delete the categorical data and issue a warning
rename_column must update the categorical table
fill_na and fill_all_na must respect the categories. Not sure how to approach the default case for fill_all_na
to_csv should create a copy with strings before passing the dataset onto csvigo
head/tail/show/unique need to have categorical_as_string = true option
where needs to check if the column is a categorical, perhaps add the categorical_as_string = true option

It is a rather extensive change that is needed but I would really love this feature. Any thoughts on the names or the approach?

__init fails with table input

Noticed that the Dataframe can't be initialized with a simple table:

function df_tests.__init()
   local a = Dataframe("simple_short.csv")
   tester:eq(a:shape(), {rows=4, cols=3},
     "The simple_short.csv is 4x3")

   local b = Dataframe({['first_column']=3,['second_column']={10,11,12}})
   tester:assertTableEq(b:get_column("first_column"), {3,3,3})
   tester:assertTableEq(b:get_column("second_column"), {10,11,12})
end

I added the ability to have an argument in the __init for quickly loading a CSV but hadn't checked if it works with tables and unfortunately it seems that dok.unpack has issues with tables as arguments.

Internal storage

I think that it would be beneficial to use Torch's tensors as internal storage instead of tables. This would allow us to:

a more efficient storage (the tables flexibility probably costs)
reduce the risk of conversion issues in the to_tensor()
separate float/double from integers which would be beneficial in the output functions.

The API changes would probably mostly affect get_column() where as_tensor should default to true. This isn't something that I plan to pursue at the moment but I figure that I'd add this as this could be worth-while considering.

Dataframe slow when loading large csv file

Hello, I'm trying to load a 1.5GB csv file called 'train.csv'.

Here's what I type:

o = Dataframe()
o:load_csv{'../data/train.csv',verbose=true}

Then it outputs

<csv>   parsing file: ../data/train.csv 
<csv>   parsing done

and it hangs there.

Help please?

Batch loading

It would be neat if the Dataframe could deliver tensor pairs for training. Since I work with images my CSV files frequently look like this:

| file            | gender  | weight | 
| a_pic.png       | male    |     73 | 
| another_pic.png | female  |     55 |
| ...

Now it would be great if to get the weight and gender as the label-tensor and the pics loaded as data tensor, i.e.:
data, label = my_dataframe.loadBatch(batch_no, batch_size, shuffle, data_loader_function)

Where the data_loader_function expects a row and returns a tensor. The tensors are then appended at the end.

Solution structure

The data shuffle order should be a separate parameter set through torch.randperm(self.n_rows) set through _refresh_metadata.

Possible issues

It may be tricky to do mean subtraction, we could provide a data manipulation function although that should probably be done outside the batch loader.

Changing dok.unpack to argcheck

The dok.unpack has been superseded by argchech that allows a faster unpacking of arguments due to compilation during function construction. The package allows mixed inputs in two different ways, overloading or custom istype. Below are implementations of both:

local argcheck = require 'argcheck'
addfive = argcheck{
  {name="x", type="number"},
  call = function(x) -- called in case of success
           print(string.format('%f + 5 = %f', x, x+5))
         end
}

addfive = argcheck{ -- overwrite it
  {name="str", type="string",
    help = "Must be a numeric string",
    check = function(x) return tonumber(x) ~= nil end},
  overload = addfive, -- overload the previous one
  call = function(str) -- called in case of success
           addfive(tonumber(str))
         end
}

addfive = argcheck{ -- overwrite it
  {name="str", type="torch.*Tensor"},
  overload = addfive, -- overload the previous one
  call = function(tnsr) -- called in case of success
           for i=1,tnsr:size(1) do
             addfive(tnsr[i])
           end
         end
}

env = require 'argcheck.env' -- retrieve argcheck environement
env.istype = function(obj, typename)
  if (typename == "number|string") then
    return torch.type(obj) == "number" or
      torch.type(obj) == "string"
  end
  if (typename == "torch.*Tensor") then
    -- regular expressions don't work therefore this
    return torch.type(obj) == "torch.IntTensor" or
      torch.type(obj) == "torch.FloatTensor" or
      torch.type(obj) == "torch.DoubleTensor"
  end
  return torch.type(obj) == typename
end
alt_addfive = argcheck{
  {name="x", type="number|string"},
  call = function(x) -- called in case of success
           if (type(x) == 'string') then
             assert(tonumber(x) ~= nil, ("String '%s' is can't be converted to a number"):format(x))
             x = tonumber(x)
           end
           print(('alt %f + 5 = %f'):format(x, x+5))
         end
}

alt_addfive = argcheck{
  {name="x", type="torch.*Tensor"},
  overload = alt_addfive,
  call = function(tnsr) -- called in case of success
          for i=1,tnsr:size(1) do
            alt_addfive(tnsr[i])
          end
        end
}

addfive(2)
addfive ("2")
addfive(torch.Tensor({10,20}))

alt_addfive(2)
alt_addfive ("2")
alt_addfive(torch.Tensor({10,20}))

local status, err = pcall(function () addfive("2a") end)
print("\nOverload error")
print(err)

local status, err = pcall(function () alt_addfive("2a") end)
print("\nThe alternative assert  error")
print(err)

This prints as:

$ th test.lua 
2.000000 + 5 = 7.000000 
2.000000 + 5 = 7.000000 
10.000000 + 5 = 15.000000   
20.000000 + 5 = 25.000000   
alt 2.000000 + 5 = 7.000000 
alt 2.000000 + 5 = 7.000000 
alt 10.000000 + 5 = 15.000000   
alt 20.000000 + 5 = 25.000000   

Overload error  
[string "argcheck"]:67: 
Arguments:

({
   x = number  -- 
})


or

Arguments:

({
   str = string  -- Must be a numeric string
})


or

Arguments:

({
   str = torch.*Tensor  -- 
})


Got: string
invalid arguments!  

The alternative assert  error   
test.lua:47: String '2a' is can't be converted to a number

As the torch-dataframe package relies heavily on the : operator that provides the function with a hidden self argument this has to be added through:

local argcheck = require 'argcheck'

test = torch.class('test')
test.__init =  argcheck{
  {name="self", type = "test"},
  call = function(self)
    self.data = "my_data"
  end
}

test.test = argcheck{
  {name="self", type = "test"},
  call = function(self) -- called in case of success
           print(self.data)
         end
}

a = test.new()
a:test()
a.test() -- should fail as this is an invalid call

that will print the following:

$ th test.lua 
my_data 
/home/max/tools/torch/install/bin/luajit: [string "argcheck"]:28: 
Arguments:

({
   self = test  -- 
})


Got:
invalid arguments!
stack traceback:
    [C]: in function 'error'
    [string "argcheck"]:28: in function 'test'
    test.lua:20: in main chunk
    [C]: in function 'dofile'
    ...ools/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405d70

I'm not sure if a wrapper around argcheck should be built that always adds the {name="self", type="Dataframe"} - it could lead to confusion as the function must be defined with the self argument.

API - documentation

You have currently used a system similar to LDoc but it doesn't entirely follow those rules. It may be worth changing the syntax to something that provides an automated documentation for at least the API. Coming from R I'm used to using roxygen2 that handles this entire problem in an excellent fashion.

I also like the dok.unpack functionality and it basically takes care of all the documentation (except the return value). Ideally I would like to combine the dok.unpack with LDoc but I guess that would require adding an LDoc extension. I'm pretty sure that there is some way of getting the torch/dok to generate documentation but for a documentation package it is surprisingly poorly documented.

Adding metatable functions

I think it would be nice to add some torch class metatable functionality in addition to the __tostring__. The available methods according to the docs are:

> for k, v in pairs(torch.getmetatable('torch.CharStorage')) do print(k, v) end

__index__       function: 0x1a4ba80
__typename      torch.CharStorage
write           function: 0x1a49cc0
__tostring__    function: 0x1a586e0
__newindex__    function: 0x1a4ba40
string          function: 0x1a4d860
__version       1
read            function: 0x1a4d840
copy            function: 0x1a49c80
__len__         function: 0x1a37440
fill            function: 0x1a375c0
resize          function: 0x1a37580
__index         table: 0x1a4a080
size            function: 0x1a4ba20

`index`

Single number

Calling __index__ with a number should naturally return a single row. The return object should be a dataframe and either we change the current get_row to use the _create_subset or we call that directly.

A table or tensor with integers

This should probably work just as the _create_subset

A string value

I'm not sure if it is a good idea but a string value could result in get_column since columns are always strings.

`newindex`

Single number

A simple call to _update_single_row after asserting that the index exists.

If index is self.n_rows + 1 then insert should be invoked.

String

If column exists it should drop the column if argument is nil otherwise throw an error.

If string is non-existant it should call the add_column

`len`

Returns the self.n_rows

`size`

Should return shape()["rows"] when called with size(1) and shape()["cols"] when called with size(2). Empty string returns unnamed table with {no_rows, no_cols}, possibly it should return a tensor for sticking with the Torch spirit

`copy`

A call to clone and _copy_meta should do it.

The `as_categorical` needs a morer stable number to label connection

The as_categorical needs options for deciding the numerical order of the parameters. If a dataset changes and a user adds/removes a class the numbers may drastically change.

The parameters should also be sorted in alphabetical order. Currently a category may receives a number depending on the Lua table hash function that may be unpredictable.

Add compatibility to make integration into Torchnet simpler

Facebook's Torchnet (just released) has it's own implemented dataset solution. It lacks a csv-interface, handling of categories, and core statistics. It is from what I understand mostly an approach to sampling that requires functions to have two functions implemented:

dataset:size() which returns the size of the dataset.
dataset:get(idx) where idx is a number between 1 and the dataset size.

This would require changing the Dataframe:size() function in the develop branch to return only number rows instead of both rows and columns. The get() is synonymous with get_row() if I understand this correctly:

In torchnet, a sample returned by dataset:get() is supposed to be a Lua table. Fields of the table can be arbitrary, even though many datasets will only work with torch tensors.

The latter sentence suggests that changing the internal storage (issue #16) may be wise for optimal integration.

Switch `load_batch` to use torch-dataset’s samplers

There are multiple batch loading structures that are worth considering. The current implementation only allows linear and permuted. This proposal is an enhancement that will extend the available sampling methods.

The torch-dataset

The excellent torch-dataset is Twitter’s solution to large datasets and has an extensive sampler solution:

LinearSampler: Linear walk - straight walk through the data, useful for non-training. Needs reset or the function only returns nil after 1 epoch
UniformSampler: Uniform never ending sampling
PermutationSampler: Permutations with shuffling after each epoch. Needs reset or the function only returns nil after 1 epoch
LabelDistributionSampler: Label-uniform per label uniform sampling
LabelPermutationSampler: Label-permutation per label permutations

There are also some adaptations to SlowFS where the dataset is split over several files. This is beyond the scope of torch-dataframe that frequently relies on knowing what’s in the data.

Refactoring in torch-dataframe

Implementing the solution would require some refactoring. Currently we rely on init_batch for setting up the data split and then we use load_batch for sampling the data and converting the data into two tensors, the data and the labels that will be used for training/testing. For convenience the function also returns the names of the labels making multilabel with feedback testing easier. The suggested refactoring would be:

The main Dataframe is initialized using create_subsets with a train, validate and test split, i.e. similar to current init_batch
Each subset is a separate dataframe subclass that has two columns, (1) indexes with the corresponding index in the main dataframe, (2) labels that some of the samplers require. This will mean that the labels are duplicated and perhaps a little waist of space but I don't think that for that kind of huge datasets the torch-dataframe is the best solution
When you want to retrieve a batch from a subset you call the subset using my_dataframe:get_subset('train'):get_batch(30) or my_dataframe['/train']:get_batch(30). The subset will thus need to store a reference to the main dataframe but since we're not doing any copying I think this is a better solution than my_dataframe:get_batch('train', 30). The get_subset is the previous get_batch - The batch returned is also a subclass that has a custom to_tensor function that returns the data and corresponding label tensors

Parallelization

A get_batch should be quick while the to_tensor may be more time consuming. This also allows for the batch dataframe to be sent to a parallel thread without worrying about offsets etc.

Adaptations to samplers

There will be some need for sampler adaptation:

Add argcheck to samplers
Add samplers as an extension to the subset class
Change the index object and make everything more explicit
Move test cases into the busted framework

Column order

As the network expects labels to appear in a certain order it is important that the tensor follows the exact structure of the CSV-file. From my understanding the key order in Lua is undefined and it may therefore be beneficial to add a order that is derived from the CSV-file.

Implementation details

By adding a self.column_order to the file that is populated by load_csv/load_table we can make sure that the columns in the to_tensor are always in the same order. We could use the csvigo fromcsv function as a source of inspiration. The head, tail, and show functions should of course respect the column order.

Problem with rockspec 1.0.0

I'm trying to build my nvidia-docker with the package but I run into this error:

RUN luarocks install https://raw.githubusercontent.com/AlexMili/torch-dataframe/master/torch-dataframe-1.0-0.rockspec
 ---> Running in 187ece69d55c

Error: Directory v1.0-0 not found inside archive v1.0-0.tar.gz
Using https://raw.githubusercontent.com/AlexMili/torch-dataframe/master/torch-dataframe-1.0-0.rockspec... switching to 'build' mode
The command '/bin/sh -c luarocks install https://raw.githubusercontent.com/AlexMili/torch-dataframe/master/torch-dataframe-1.0-0.rockspec' returned a non-zero code: 1

Using torch-dataframe for time-series

First of all I would like to congratulate you for this great project.

I would like to know if it is possible to use the torch-dataframe for time series study.

Something similar to xts in the R.

Example:

TimeSeries1:

| Date | Values1 |

| 2016-12-27 21:00:00 | 10.00 |
| 2016-12-27 21:01:00 | 10.01 |
| 2016-12-27 21:02:00 | 10.02 |
| 2016-12-27 21:04:00 | 10.04 |
| 2016-12-27 21:07:00 | 10.07 |

TimeSeries2:

| Date | Values2 |

| 2016-12-27 21:00:00 | 20.00 |
| 2016-12-27 21:01:00 | 20.01 |
| 2016-12-27 21:03:00 | 20.03 |
| 2016-12-27 21:05:00 | 20.05 |
| 2016-12-27 21:06:00 | 20.06 |
| 2016-12-27 21:07:00 | 20.07 |

Merge result of TimeSeries1 with TimeSeries2:

| Date | Values1 | Values2 |

| 2016-12-27 21:00:00 | 10.00 | 20.00 |
| 2016-12-27 21:01:00 | 10.01 | 20.01 |
| 2016-12-27 21:02:00 | 10.02 | NA |
| 2016-12-27 21:03:00 | NA | 20.03 |
| 2016-12-27 21:04:00 | 10.04 | NA |
| 2016-12-27 21:05:00 | NA | 20.05 |
| 2016-12-27 21:06:00 | NA | 20.06 |
| 2016-12-27 21:07:00 | 10.07 | 20.07 |

Applying na.locf to the merged TimeSeries

| Date | Values1 | Values2 |

| 2016-12-27 21:00:00 | 10.00 | 20.00 |
| 2016-12-27 21:01:00 | 10.01 | 20.01 |
| 2016-12-27 21:02:00 | 10.02 | 20.01 |
| 2016-12-27 21:03:00 | 10.02 | 20.03 |
| 2016-12-27 21:04:00 | 10.04 | 20.03 |
| 2016-12-27 21:05:00 | 10.04 | 20.05 |
| 2016-12-27 21:06:00 | 10.04 | 20.06 |
| 2016-12-27 21:07:00 | 10.07 | 20.07 |

Applying na.omit to the merged TimeSeries

| Date | Values1 | Values2 |

| 2016-12-27 21:00:00 | 10.00 | 20.00 |
| 2016-12-27 21:01:00 | 10.01 | 20.01 |
| 2016-12-27 21:07:00 | 10.07 | 20.07 |

Very Thanks

Danilo

source code indentation

I looked up the source code and then your contribution guidelines. Is there any particular reason for Indentation is a tabulation of size 2?
I have never seen Lua code or just any other code to have tabulation size set to 2.

load_table ex from README

I'm failing to run the following script from the README:

require 'Dataframe'
df = Dataframe()    
df:load_table{data={['firstColumn']={1,2,3},['secondColumn']={4,5,6}}}

I see the load_table() printout followed by this error message:
invalid arguments!
stack traceback:
[C]: in function 'error'
[string "argcheck"]:152: in function 'load_table'
df.lua:3: in main chunk
[C]: in function 'dofile'
...s/tu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x0104f32d20

Extract value counting functionnality in unique() to a specific function

I suggest adding a value_conts - this is intuitive and sticks with the original intention of keeping Pandas design principles. Compared to R's table() value_counts makes much more sense.
@gforge

converting dataframe to tensor

Hi,

Assuming that df is initialised as a dataframe, and that I use df:load_csv(filepath) to open the .csv file, before running print(df:shape()) to return its dimensions, is there an obvious way to convert df to a tensor? I tried new_tensor = torch.Tensor(df) but this didn't work.

It would be nice if the best method to do this was in the readme.

alexmili / torch-dataframe Goto Github PK

torch-dataframe's Introduction

Dataframe

Requirements

Installation

Changelog

Version: 1.7

Usage

Named arguments

Load data

Data inspection

Manipulate

Categorical variables

Subsetting

Exporting

Batch loading

Tests

Documentation

Contributing

torch-dataframe's People

Contributors

Stargazers

Watchers

Forkers

torch-dataframe's Issues

__index__

Single number

A table or tensor with integers

A string value

__newindex__

Single number

String

__len__

size

copy

The torch-dataset

Refactoring in torch-dataframe

Parallelization

Adaptations to samplers

TimeSeries1:

| Date | Values1 |

| 2016-12-27 21:00:00 | 10.00 | | 2016-12-27 21:01:00 | 10.01 | | 2016-12-27 21:02:00 | 10.02 | | 2016-12-27 21:04:00 | 10.04 | | 2016-12-27 21:07:00 | 10.07 |

TimeSeries2:

| Date | Values2 |

| 2016-12-27 21:00:00 | 20.00 | | 2016-12-27 21:01:00 | 20.01 | | 2016-12-27 21:03:00 | 20.03 | | 2016-12-27 21:05:00 | 20.05 | | 2016-12-27 21:06:00 | 20.06 | | 2016-12-27 21:07:00 | 20.07 |

Merge result of TimeSeries1 with TimeSeries2:

| Date | Values1 | Values2 |

Applying na.locf to the merged TimeSeries

| Date | Values1 | Values2 |

Applying na.omit to the merged TimeSeries

| Date | Values1 | Values2 |

| 2016-12-27 21:00:00 | 10.00 | 20.00 | | 2016-12-27 21:01:00 | 10.01 | 20.01 | | 2016-12-27 21:07:00 | 10.07 | 20.07 |

Recommend Projects

Recommend Topics

Recommend Org

`index`

`newindex`

`len`

`size`

`copy`

| 2016-12-27 21:00:00 | 10.00 |
| 2016-12-27 21:01:00 | 10.01 |
| 2016-12-27 21:02:00 | 10.02 |
| 2016-12-27 21:04:00 | 10.04 |
| 2016-12-27 21:07:00 | 10.07 |

| 2016-12-27 21:00:00 | 20.00 |
| 2016-12-27 21:01:00 | 20.01 |
| 2016-12-27 21:03:00 | 20.03 |
| 2016-12-27 21:05:00 | 20.05 |
| 2016-12-27 21:06:00 | 20.06 |
| 2016-12-27 21:07:00 | 20.07 |

| 2016-12-27 21:00:00 | 10.00 | 20.00 |
| 2016-12-27 21:01:00 | 10.01 | 20.01 |
| 2016-12-27 21:07:00 | 10.07 | 20.07 |