go-gota / gota Goto Github PK

View Code? Open in Web Editor NEW

2.9K 2.9K 273.0 538 KB

Gota: DataFrames and data wrangling in Go (Golang)

License: Other

Go 100.00%

gota's People

Contributors

Stargazers

Watchers

Forkers

vdt ariefdarmawan lhboy1984 dawnft mpoegel incazteca guduxingzou kujtimiihoxha venumeda isuruceanu svanharmelen number0 mi3-14159 qq7987590 typeless cluo thelady russmack maxwelllzh cmrajan claudiofahey ppdx danicat benjmarshall syklevin daigo kmrx dtynn szaydel johnlion tobgu adjulbic pythonai red7hj juripero magrathean aymeric451 anandgavai bbks101 bfitzsimmons paulrigor yutiansut seufagner gcla gotoolkit russellwmy kazi308 batermj zeroviscosity gosundy bxy09 alexxiang007 kazuya-kondo thimi0412 lekev kongdw vanleantking neuraloverflow marino39 daheige rpereira23 michaelbironneau pascal-m-accenta pa-m galaxyblack tennessine scramble-suit beaquant durp tingxin zofuthan jonny536486040 ograndoptimist ki4jnq ray1729 happy-ferret ebsone-forks falliani dennischen0307 hanaori klebertiko backwardn chrstphlbr striderw caldempsey petelin wooyoung94 jdgiotta liufuqiang yssource mannharleen oceangenomics xiaowj huangzhangfeng iobond alessandro-c pixyguo osmandi stungkit gautamdoulani

gota's Issues

ReadCSV should take a Reader as input rather than a string

I've started looking at reimplementing some of the functionality of https://github.com/tobgu/qcache in Go (currently python and pandas) and found gota. Looks good, keep it up!

I think it would be better to take a Reader as input to ReadCSV than a string. As it is right now I have to read all bytes from a reader into a byte buffer that then has to be converted to a string. That string is then immediately converted into a Reader in ReadCSV.

Do you agree?

Comparing equality of two DataFrames

How one does compare two DataFrames? Is the functionality of directly comparing the equality or not of two DataFrames desirable?

A first check for dimensions, column types and column names could be useful to quickly reject the equality. Column order is important, so two DataFrames that have the same data but the order columns are switched should be marked as not equal.

When comparing columns of different types we can compare element by element and row by row or we could try to think of hashing the rows and/or columns and check these. If the hashes are stored, this approach would allow for faster comparisons when we are comparing a DataFrame multiple times.

func (d DataFrame) Eq(b DataFrame) bool { ... }

Records method but for float64

Hello,

First of all I would like to show my appreciation for this library, it does a lot of redundant heavy-lifting.

For a machine learning project I'm using gota to load a CSV file and input the data into an algorithm. The thing is need to cast a DataFrame to a [][]float64 slice of slices. I noticed there is a DataFrame.Records method to cast the DataFrame as a slice of slices of strings. Would it in any way be possible to do the same thing for the float64? I think this is really practical because it is a common use-case for machine learning applications.

Regards.

Is it possible to add a method to lazyread a csv?

I'm doing this kind of thing:

	fmt.Println("Reading csv...")	
	csv, err := os.Open(myfile)  //myfile is 200M or so, takes awhile to read
	if err != nil {
		fmt.Print(err)
		os.Exit(1)
	}

	fmt.Println("Make it a df...")			
	df := dataframe.ReadCSV(csv)

	fmt.Println("Sorting, filtering df...")		
	fil := df.Filter(
		dataframe.F{"colA", series.Eq, "VARIABLE"},
    )

Would be very cool if my filtering could start happening as the initial lines are read.

Split-Apply-Combine methods for DataFrames

The Split-Apply-Combine data analysis paradigm focus on separating the data rows in groups, apply a function over each group rows/cols and then combining the results into a unique table.

Split/Group

The grouping could be done by first splitting the rows by a given factor Series:

func (d DataFrame) Split(factor Series) ([]DataFrame, err) { ... }

It could also be done by storing the grouping factor inside the DataFrame object and then delegating the responsibility of using this grouping to the functions that need it. This is more similar to what dplyr does and it could facilitate chain operations.

func (d DataFrame) GroupBy(factor Series) DataFrame { ... }
func (d DataFrame) Split() ([]DataFrame, err) { ... } // Uses the stored GroupBy groups

Maybe instead of passing GroupBy a Series, we could rely on the column name and use one of the columns instead. This will help a lot when subsetting.

Apply

We want to be able to apply functions to both rows and columns over a DataFrame. The dimension of the returned Series should be compatible with each other. Additionally, when applying functions over rows, since we can't expect the columns to be all of the same type, we will have to cast the types.

The API should be pretty straightforward:

func (d DataFrame) RApply(f func(Series) Series) DataFrame { ... }
func (d DataFrame) CApply(f func(Series) Series) DataFrame { ... }

With the implementation of Apply operations we will have a powerful aggregation mechanism that don't have to depend on data splitting but could work on it's own.

Combine

The easier of the bunch. The main decision is if we want to try to preserve the original order or we will just concatenate the results of all DataFrame group operations.

Delete/Drop Function

Have you considered the addition of a delete/remove/drop type function? I've started to use your library and have come across a use case where it would be advantageous to be able to explicitly identify a column to be removed. Currently I would have to get all of the column names, find which index I want to remove, and then generate a series of indexes excluding that one to perform a Select operation.

I am happy to have a first pass at adding this type of function if there is a consensus it would be helpful.

Does this repo continue？

thanks ! this repo is very good 。
这个项目简直太好了。希望您能继续维护！

Add support for GroupBy and Summarize

A fundamental feature of dataframes is grouping by column/s and summarizing (mean, median, max, min, etc..) other column/s, are you thinking about implementing this functionality?

Support bitwise left-shift and right-shift

I was wondering how you feel about series of Integers supporting bitwise operations like shifts? I actually realized that I could benefit from this in my own work, where for example I have a whole bunch of numbers that are say in kilobytes, but I really want to operate on bytes instead of KBs.

Thanks!

how to get the SQLDatabase into GOTA

can you please subscribe or detail a method to get the data collected from SQL into GOTA.

Something like

package main

import (
"database/sql"
"fmt"
_ "github.com/lib/pq"
)

const (
host = "localhost"
port = 5432
user = "postgres"
password = "Gurgaon@65"
dbname = "vikram"
)

func main() {
psqlInfo := fmt.Sprintf("host=%s port=%d user=%s "+
"password=%s dbname=%s sslmode=disable",
host, port, user, password, dbname)
db, err := sql.Open("postgres", psqlInfo)
if err != nil {
panic(err)
}
defer db.Close()

err = db.Ping()
if err != nil {
	panic(err)
}
row,_:=db.Query("SELECT * from salesdata limit 10")

println(row)

}

Which license applies to this repo

It looks a quite interesting project and you may find some people to collaborate, but without specifying a LICENSE is pretty hard that people decide to push something.

LICENSE can be any, some people may not like to collaborate depending on it, but at least specifying one, everybody will be clear, than having to fallback to whatever law.

Thanks for considering

Usage with binary data ?

This is useful for me. In the roadmap, have you considered binary data ( images etc ) and tensors ?
I do a fair bit of ETL and fbp style work in golang, and so might be able to contribute

Does there exist a way, or a plan to provide appending new data?

Having a look at the godoc, I couldn't find any API to add a new data, i.e. a new row, to the existing Dataframe. Is it planned to be implemented, or is there a hack to deal with?

Expand Select method to accept []int, []bool and Series

Currently the DataFrame.Select method only accepts the column names to
be selected. It could be interesting to be able to subset the columns
using other array types or even Series.

Error when data frame on 1000 records

I run dataframe 1000 records, show error as
Error indexer greater than 1000

df = df.Set(
indexer,
dataframe.LoadRecords(
[][]string{
[]string{"Name", "Total"},
[]string{keyword, total},
},
),
)

![error

Hope everybody help please
Thank you

Bug: Non sequential column names when auto-renaming

When using automatic renaming of column names, the numeric suffix is not assigned sequentially if the column names that are repeated appear more than twice. This means that this works as expected:

     b := LoadRecords(
         [][]string{
             []string{"A", "B", "B", "C", "D"},
             []string{"1", "4", "4", "5.1", "true"},
             []string{"1", "4", "4", "6.0", "true"},
             []string{"2", "3", "3", "6.0", "false"},
         },
     )
     fmt.Print(b.Names())

> [A B_0 B_1 C D]

But this won't:

     b := LoadRecords(
         [][]string{
             []string{"A", "B", "B", "B", "C", "D"},
             []string{"1", "4", "4", "4", "5.1", "true"},
             []string{"1", "4", "4", "4", "6.0", "true"},
             []string{"2", "3", "3", "3", "6.0", "false"},
         },
     )
     fmt.Print(b.Names())

> [A B_1 B_3 B_5 C D] // Expected [A B_0 B_1 B_2 C D]

how to handle the datetime type data?

does this supports datetime type data?
how to handle the datetime type data?

thanks!

Add support for more types

In addition to the four main types (Strings, Int, Float, Bool), the next candidates for new types are:

Date (time.Time)
Decimal (big.Int/Rational for currency analyses)
Complex (complex64/complex128)

The pros and cons of these additions have to be taken into account, since every new type increases the complexity of the library significantly.

Improve DataFrame Stringer interface behaviour for large tables

Consider modifying DataFrame.String() by limiting the column length to a number of characters. Likewise, it could be interesting to wrap columns in separate lines if the combined length is too large.

Additionally one might want to summarise this information if the number of rows is very high. We could use something like what dplyr or data.table are doing, showing only the first and last 10/20 rows instead of the whole table.

An alternative could be to leave DataFrame.String() as is and move these suggested modifications to a separate function. That way, if we want to print the entire table instead of a summary of it we will still be able to do it.

New index?

pandas has the ability to set your own data as new index for a dataframe. gota seems to not have this?

ReadAll has a bad failure condition

Was using the project and noticed a weird situation on low memory machines where data ended up missing on different runs from the bottoms rows of CSV files.

I'm pretty sure this is the culprit - There's a nice tangential article on why readall functions are considered bad

https://haisum.github.io/2017/09/11/golang-ioutil-readall/

Looking at the csv.ReadAll function it will allocate up until max memory and then just drop records on the floor. Due to the interface provided by gota - there's no way to pass a Reader() style interface which would allow us to work around it.

Any thoughts on fixing it?

Add Donation Link

Would it be possible to add a donation link? I would like to buy you a beer/coffee for all your hard work.

Support for civil Package (Date, DateTime, civil.Time)

It would be good if a Dataframe can support

civil.Date (ie. 1955-04-30)
civil.DateTime (this is not the same as time.Time)
civil.Time
time.Time (already mentioned in numerous other issues eg #22)

Also the ability to sort
Also on loading, specify a column needs to be interpreted using custom parse which produces a type mentioned above.

Ref: https://godoc.org/cloud.google.com/go/civil

The example doesn't add up.

The function application example in the README:

mean := func(s series.Series) series.Series {
    floats := s.Float()
    sum := 0.0
    for _, f := range floats {
        sum += f
    }
    return series.Floats(sum / float64(len(floats)))
}
df.Cbind(mean)
df.Rbind(mean)

CBind and RBind seem to receive a DataFrame value rather than a function value. (?)

Allowing type specification through a map rather than a variadic string argument would be more flexible

Right now I have to either specify all types or no types at all. Specifying the types in a map[string]string (column name -> type name) would add the possibility to specify types only for the columns you want to and fallback to auto typing for the other columns.

It could possibly also shorten the code in ReadRecords by simply checking if the column name is in the map and if not fallback to findType.

What do you think?

is this project dumped out

are there any similar package or fork of this package to help us work well with dataframes.

Please at least reply if you are planning to drop the package.

Incorrect output when sorting on multiple columns using DataFrame.Arrange().

I'm currently having an issue when attempting to sort by multiple columns.

Given the following code (I'll explain the commented lines in a moment.):

package main

import (
	"fmt"

	"github.com/kniren/gota/dataframe"
)

func main() {
	df := dataframe.LoadRecords(
		[][]string{
			{"A", "B"},
			{"0.346", "662"},
			{"0.331", "725"},
			// { "0.33", "561"},
			// { "0.322", "593"},
			// { "0.322", "543"},
			// { "0.32", "707"},
			// { "0.32", "568"},
			// { "0.318", "671"},
			// {"0.318", "645"},
			// { "0.314", "540"},
			// { "0.312", "679"},
			{"0.31", "682"},
			{"0.309", "680"},
			{"0.308", "695"},
			{"0.307", "514"},
			{"0.306", "530"},
			// { "0.306", "507"},
			// { "0.305", "597"},
			{"0.304", "675"},
			{"0.304", "718"},
			// { "0.303", "576"},
			// { "0.303", "515"},
			// { "0.301", "605"},
			// { "0.3", "645"},
			// { "0.3", "566"},
			{"0.299", "564"},
			{"0.297", "665"},
			{"0.297", "689"},
			{"0.297", "507"},
			{"0.295", "665"},
			// { "0.295", "613"},
			{"0.294", "577"},
			{"0.293", "577"},
			{"0.293", "586"},
			{"0.293", "675"},
			{"0.29", "589"},
			{"0.288", "568"},
			{"0.288", "630"},
			{"0.288", "645"},
			{"0.288", "573"},
		},
	)

	fmt.Println(df.Arrange(dataframe.Sort("A"), dataframe.Sort("B")))
}

I get a correct output of:

[23x2] DataFrame

    A        B
 0: 0.288000 568
 1: 0.288000 573
 2: 0.288000 630
 3: 0.288000 645
 4: 0.290000 589
 5: 0.293000 577
 6: 0.293000 675
 7: 0.293000 586
 8: 0.294000 577
 9: 0.295000 665
    ...      ...
    <float>  <int>

Now comes the reason for the commented out lines.

If I uncomment any of the commented lines, I get the following output.

[24x2] DataFrame

    A        B
 0: 0.288000 645
 1: 0.288000 568
 2: 0.288000 573
 3: 0.288000 630
 4: 0.290000 589
 5: 0.293000 577
 6: 0.293000 675
 7: 0.293000 586
 8: 0.294000 577
 9: 0.295000 665
    ...      ...
    <float>  <int>

The order is no longer correct. Please note the "B" column.

Since I don't yet know what combination of values is causing the incorrect sorting, I've left them all commented out in the data. This is in the hopes of someone seeing something in the values that might trigger this incorrect behavior.

Any thoughts on what might be happening?

Nice initiative, It would be great if installation instructions for this package are added.

custom sort?

What's the easiest way to do a custom sort of a dataframe? For example, I have a column with string values like 10/04/2014 04:10:10 p.m. and I would like to sort them by the date that represents (ascending).

If this is not easily possible, consider this a feature request. Thanks for a useful package.

Implement an Arrange method for DataFrames

One should be able to sort the DataFrame by one or several of it's columns.

func (d DataFrame) Arrange(keys string...) DataFrame { ... }

A possible implementation could start by enabling each Series to return a []int array containing it's sorted order. For example:

a := Strings("b", "c", "a")
var b []int = a.Order() // b == []int{2,3,1}

In case we have NA elements we should decide what to do with them. Maybe they all have the same order index and appear at the end?

a := Strings("b", nil, "c", nil,"a")
var b []int = a.order() // b == []int{2,4,3,4,1}

In any case, once we have an []int array for each key column we could calculate the new row order array and use it to sort the DataFrame.

Allow encoding of certain columns to reduce memory usage

Memory consumption is an issue when dealing with large data sets.

Similar in memory columnar stores like python's panadas and microsoft's proprietary vertipaq engine for it's ssas products have the ability to minimize memory usage by using techniques such as:

Value encoding - for numbers, vertipaq will calculate a number it can subtract from every row to lower the requirement of bits needed.
Categorical encoding (dictionary encoding) - for strings, pandas and vertipaq will create a lookup table and use integers to represent the data therefore reducing number of bits.

More info: https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3

This is a feature request for similar functionality.

Changing types of an existing DataFrame

We might want to change the type of a column or columns of a DataFrame. To do so, we could enable two methods, one for parsing a given column to the desired type and another one to change all of them at the same time.

func (d DataFrame) ChangeType(colname string, newtype SeriesType) DataFrame { ... }
func (d DataFrame) ChangeTypes(newtype []SeriesType) DataFrame { ... }

Replace comparator strings with an enum

When comparing Series or filtering with df.F, the comparator should be moved to a string/int enum for maintainability, clarity and better type safety.

type Comparators string
const (
    Eq Comparators = "eq"
    In Comparators = "in"
    ...
)

Find Unique/Duplicated rows in a DataFrame

Sometimes one might want to use a DataFrame where the duplicated elements are removed or want to identify the indexes where they appear.

Golang + Dataframes + Arrow

I stumbled upon your repo while searching around to see if anybody is using the Golang hooks for the Apache Arrow library.

I read an article recently from Wes Mckinny about his involvement with Arrow and how he’s stoked to provide a more flexible Pandas API that would support parallelism out of the box with shared memory, rather than the default pickled approach.

I’d love to see this implemented in golang. I can imagine using the Plasma store API that it would be pretty easy.

UNION two dataframes

Hi - I am trying to do a UNION ALL statement on 2 dataframes and am wondering if this is possible. I have loaded up 2 dataframes successfully but want a way to merge them together.

Thanks

--
Update - nevermind. just saw outerjoin :-)

How do I add a record to a available dataframe?

Hello Kniren !
How do I add a record to a available dataframe?

df := dataframe.LoadRecords(
		[][]string{
			[]string{"A", "B", "C", "D"},
			[]string{"a", "4", "5.1", "true"},
			[]string{"k", "5", "7.0", "true"},
			[]string{"k", "4", "6.0", "true"},
			[]string{"a", "2", "7.1", "false"},
		},
	)

I want add 1 record in last df

problem with converting string to int when dataframe.ReadJSON function is called

when there is a big integer in the json file, like 20180428, the encoder will convert it to string as a float number, like '2.0180428e+07', the string cannot be correctly converted to a integer

Profiling, memory usage and performance review

So far there has not been an study on the performance of this library in terms of speed and memory consumption. I'm prioritising now those features that impact the users directly, since the API design is still on flux, but this should be addressed on the near future.

Allow the modification of DataFrames/Series values

As it currently stands, when a Series is created, its elements should not be able to change for other than subsetting operations. However, we might want to modify elements of a Series once it has been initialised (Not necessarily by modifying the Series in situ but returning a new Series with the updated values).

An appropriate API should be designed for this purpose and then integrate Series modifications directly into DataFrame operations.

Filter dataframe return indexer

I have a data frame like this

df := dataframe.LoadRecords(
[][]string{
[]string{"A", "B", "C", "D"},
[]string{"a", "4", "5.1", "true"},
[]string{"k", "5", "7.0", "true"},
[]string{"k", "4", "6.0", "true"},
[]string{"a", "2", "7.1", "false"},
},
)

I want to filter column B to a value of 5, I want the return value to be int 2 .

2 is indexer of df

Implement a DataFrame Summary method

Sometimes it is really interesting to get a fast summary of the data contained in a DataFrame quickly, with the dimensions, counts or quartile information depending on the type of the column. In R this could be done with something like summary(df). Perhaps we should try to mimic this functionality and expand upon it for quick data summarization.

colnames get lost after calling Rapply()

Hi alex, thank you for your great work.

I noticed that the column names get lost after calling Rapply during my tests, also the detected types.

test codes:

package main

import (
	"log"

	"github.com/kniren/gota/dataframe"
	"github.com/kniren/gota/series"
)

func main() {
	df := dataframe.LoadRecords(
		[][]string{
			[]string{"A", "B", "C", "D"},
			[]string{"a", "4", "5.1", "true"},
			[]string{"k", "5", "7.0", "true"},
			[]string{"k", "4", "6.0", "true"},
			[]string{"a", "2", "7.1", "false"},
		},
	)

	applied := df.Rapply(func(s series.Series) series.Series {
		return s
	})

	log.Println(df)
	log.Println(applied)
}

output:

2017/11/01 17:38:32 [4x4] DataFrame

    A        B     C        D
 0: a        4     5.100000 true
 1: k        5     7.000000 true
 2: k        4     6.000000 true
 3: a        2     7.100000 false
    <string> <int> <float>  <bool>

2017/11/01 17:38:32 [4x4] DataFrame

    X0       X1       X2       X3
 0: a        4        5.100000 true
 1: k        5        7.000000 true
 2: k        4        6.000000 true
 3: a        2        7.100000 false
    <string> <string> <string> <string>

Improve error handling and reporting

Error handling could be improved by using errors.Wrap and errors.Unwrap for more descriptive error messages. Also, error handling for Series should be managed in the same way that is done on DataFrames, by reading from the Series.Err() method to retrieve the error message.

Ideally if we have a pipe of DataFrame operations we want to be able to track at which point it failed. Maybe to do so we have to store some piping information inside the DataFrame structure to know what the pipe operation looks like.

Remove unnecessary memory allocations

As it stands on v0.6.0 there are too many memory allocations as intermediate entities. This should be reviewed and corrected.

Question about Type Accessor/conversion method

So while this is intentional, I wanted to make sure I understand reasoning before deciding how to work with this. Whereas Float() for example does not have two return parameters, Int() and Bool() do. What was the thinking there? Is the idea that some of these conversions more fellable than others?

// Accessor/conversion methods
Copy() Element     // FIXME: Returning interface is a recipe for pain
Val() ElementValue // FIXME: Returning interface is a recipe for pain
String() string
Int() (int, error)
Float() float64
Bool() (bool, error)

Thanks a lot, sorry if I am being dense...

typo in README.md

Cbind and Rbind needs to be updated to Capply and Rapply respectively.

Documentation clearly states it is Capply but readme might throw off newcomers.

Thanks

Set value by filter dataframe

I have a dataframe

df := dataframe.LoadRecords(
[][]string{
[]string{"Name", "Total"},
[]string{"ABC", "4"},
[]string{"XYX", "5"},
[]string{"MNK", "4"},
[]string{"OPP", "2"},
},
)

I filter the Name column for the OPP value and I want to change the Total column from 2 to 3.
Hope for a help
Thank you

Is using sort.Sort safe when specifying more than one order column?

The Go stdlib documentation states that sort.Sort does not guarantee stability of the sorted results (sort.Stable does). Isn't stability a requirement when sorting the dataframe according to the content of multiple columns to guarantee correctness the way dataframe.Arrange is currently implemented?

add WriteOption to write CSV with all values quoted

Hi,

I am reading from csv files where all the values are double quoted, even if they do not contain a comma (or whatever the delimiter is).

I read these into a DataFrame and then I do some transformation on it and write it back out to a CSV file. The resulting CSV file's values are not double-quoted unless they contain a comma.

I'd like to be able to pass a WriteOption to WriteCSV() that would force the quoting of all values written., even if they do not contain a delimiter. Just to have consistency between my input and output files.

If this request sounds weird, I will explain my use case. My input files are medical study data that contains personally identifiable information such as name and birth date. My code basically takes this information and changes it to random strings of characters that resemble the original but is no longer identifiable as a specific person. I take the resulting csv files and use them as test fixture data to test another code base. This fixture data can be checked into a public GitHub repository because it no longer contains identifying information. I would like the files to be identical in all respects to the original files (except for the identifying information) so that I can have confidence that my passing tests mean the code will also work with real data. That's why I want the csv files to have all fields double-quoted even if it does not seem necessary or is not called for by the CSV spec.

Does that make sense?

Thanks for a nice package.