go-gota / gota Goto Github PK
View Code? Open in Web Editor NEWGota: DataFrames and data wrangling in Go (Golang)
License: Other
Gota: DataFrames and data wrangling in Go (Golang)
License: Other
I've started looking at reimplementing some of the functionality of https://github.com/tobgu/qcache in Go (currently python and pandas) and found gota. Looks good, keep it up!
I think it would be better to take a Reader as input to ReadCSV than a string. As it is right now I have to read all bytes from a reader into a byte buffer that then has to be converted to a string. That string is then immediately converted into a Reader in ReadCSV.
Do you agree?
How one does compare two DataFrames? Is the functionality of directly comparing the equality or not of two DataFrames desirable?
A first check for dimensions, column types and column names could be useful to quickly reject the equality. Column order is important, so two DataFrames that have the same data but the order columns are switched should be marked as not equal.
When comparing columns of different types we can compare element by element and row by row or we could try to think of hashing the rows and/or columns and check these. If the hashes are stored, this approach would allow for faster comparisons when we are comparing a DataFrame multiple times.
func (d DataFrame) Eq(b DataFrame) bool { ... }
Hello,
First of all I would like to show my appreciation for this library, it does a lot of redundant heavy-lifting.
For a machine learning project I'm using gota to load a CSV file and input the data into an algorithm. The thing is need to cast a DataFrame
to a [][]float64
slice of slices. I noticed there is a DataFrame.Records
method to cast the DataFrame
as a slice of slices of strings. Would it in any way be possible to do the same thing for the float64
? I think this is really practical because it is a common use-case for machine learning applications.
Regards.
I'm doing this kind of thing:
fmt.Println("Reading csv...")
csv, err := os.Open(myfile) //myfile is 200M or so, takes awhile to read
if err != nil {
fmt.Print(err)
os.Exit(1)
}
fmt.Println("Make it a df...")
df := dataframe.ReadCSV(csv)
fmt.Println("Sorting, filtering df...")
fil := df.Filter(
dataframe.F{"colA", series.Eq, "VARIABLE"},
)
Would be very cool if my filtering could start happening as the initial lines are read.
The Split-Apply-Combine data analysis paradigm focus on separating the data rows in groups, apply a function over each group rows/cols and then combining the results into a unique table.
The grouping could be done by first splitting the rows by a given factor Series:
func (d DataFrame) Split(factor Series) ([]DataFrame, err) { ... }
It could also be done by storing the grouping factor inside the DataFrame
object and then delegating the responsibility of using this grouping to the functions that need it. This is more similar to what dplyr
does and it could facilitate chain operations.
func (d DataFrame) GroupBy(factor Series) DataFrame { ... }
func (d DataFrame) Split() ([]DataFrame, err) { ... } // Uses the stored GroupBy groups
Maybe instead of passing GroupBy a Series, we could rely on the column name and use one of the columns instead. This will help a lot when subsetting.
We want to be able to apply functions to both rows and columns over a DataFrame. The dimension of the returned Series should be compatible with each other. Additionally, when applying functions over rows, since we can't expect the columns to be all of the same type, we will have to cast the types.
The API should be pretty straightforward:
func (d DataFrame) RApply(f func(Series) Series) DataFrame { ... }
func (d DataFrame) CApply(f func(Series) Series) DataFrame { ... }
With the implementation of Apply operations we will have a powerful aggregation mechanism that don't have to depend on data splitting but could work on it's own.
The easier of the bunch. The main decision is if we want to try to preserve the original order or we will just concatenate the results of all DataFrame group operations.
Have you considered the addition of a delete/remove/drop type function? I've started to use your library and have come across a use case where it would be advantageous to be able to explicitly identify a column to be removed. Currently I would have to get all of the column names, find which index I want to remove, and then generate a series of indexes excluding that one to perform a Select operation.
I am happy to have a first pass at adding this type of function if there is a consensus it would be helpful.
thanks ! this repo is very good 。
这个项目简直太好了。希望您能继续维护!
A fundamental feature of dataframes is grouping by column/s and summarizing (mean, median, max, min, etc..) other column/s, are you thinking about implementing this functionality?
I was wondering how you feel about series of Integers supporting bitwise operations like shifts? I actually realized that I could benefit from this in my own work, where for example I have a whole bunch of numbers that are say in kilobytes, but I really want to operate on bytes instead of KBs.
Thanks!
can you please subscribe or detail a method to get the data collected from SQL into GOTA.
Something like
package main
import (
"database/sql"
"fmt"
_ "github.com/lib/pq"
)
const (
host = "localhost"
port = 5432
user = "postgres"
password = "Gurgaon@65"
dbname = "vikram"
)
func main() {
psqlInfo := fmt.Sprintf("host=%s port=%d user=%s "+
"password=%s dbname=%s sslmode=disable",
host, port, user, password, dbname)
db, err := sql.Open("postgres", psqlInfo)
if err != nil {
panic(err)
}
defer db.Close()
err = db.Ping()
if err != nil {
panic(err)
}
row,_:=db.Query("SELECT * from salesdata limit 10")
println(row)
}
It looks a quite interesting project and you may find some people to collaborate, but without specifying a LICENSE is pretty hard that people decide to push something.
LICENSE can be any, some people may not like to collaborate depending on it, but at least specifying one, everybody will be clear, than having to fallback to whatever law.
Thanks for considering
This is useful for me. In the roadmap, have you considered binary data ( images etc ) and tensors ?
I do a fair bit of ETL and fbp style work in golang, and so might be able to contribute
Having a look at the godoc, I couldn't find any API to add a new data, i.e. a new row, to the existing Dataframe
. Is it planned to be implemented, or is there a hack to deal with?
Currently the DataFrame.Select
method only accepts the column names to
be selected. It could be interesting to be able to subset the columns
using other array types or even Series
.
When using automatic renaming of column names, the numeric suffix is not assigned sequentially if the column names that are repeated appear more than twice. This means that this works as expected:
b := LoadRecords(
[][]string{
[]string{"A", "B", "B", "C", "D"},
[]string{"1", "4", "4", "5.1", "true"},
[]string{"1", "4", "4", "6.0", "true"},
[]string{"2", "3", "3", "6.0", "false"},
},
)
fmt.Print(b.Names())
> [A B_0 B_1 C D]
But this won't:
b := LoadRecords(
[][]string{
[]string{"A", "B", "B", "B", "C", "D"},
[]string{"1", "4", "4", "4", "5.1", "true"},
[]string{"1", "4", "4", "4", "6.0", "true"},
[]string{"2", "3", "3", "3", "6.0", "false"},
},
)
fmt.Print(b.Names())
> [A B_1 B_3 B_5 C D] // Expected [A B_0 B_1 B_2 C D]
does this supports datetime type data?
how to handle the datetime type data?
thanks!
In addition to the four main types (Strings, Int, Float, Bool), the next candidates for new types are:
The pros and cons of these additions have to be taken into account, since every new type increases the complexity of the library significantly.
Consider modifying DataFrame.String()
by limiting the column length to a number of characters. Likewise, it could be interesting to wrap columns in separate lines if the combined length is too large.
Additionally one might want to summarise this information if the number of rows is very high. We could use something like what dplyr
or data.table
are doing, showing only the first and last 10/20 rows instead of the whole table.
An alternative could be to leave DataFrame.String()
as is and move these suggested modifications to a separate function. That way, if we want to print the entire table instead of a summary of it we will still be able to do it.
pandas has the ability to set your own data as new index for a dataframe. gota seems to not have this?
Was using the project and noticed a weird situation on low memory machines where data ended up missing on different runs from the bottoms rows of CSV files.
I'm pretty sure this is the culprit - There's a nice tangential article on why readall functions are considered bad
Looking at the csv.ReadAll function it will allocate up until max memory and then just drop records on the floor. Due to the interface provided by gota - there's no way to pass a Reader() style interface which would allow us to work around it.
Any thoughts on fixing it?
Would it be possible to add a donation link? I would like to buy you a beer/coffee for all your hard work.
It would be good if a Dataframe can support
civil.Date
(ie. 1955-04-30)civil.DateTime
(this is not the same as time.Time
)civil.Time
time.Time
(already mentioned in numerous other issues eg #22)The function application example in the README:
mean := func(s series.Series) series.Series {
floats := s.Float()
sum := 0.0
for _, f := range floats {
sum += f
}
return series.Floats(sum / float64(len(floats)))
}
df.Cbind(mean)
df.Rbind(mean)
CBind
and RBind
seem to receive a DataFrame
value rather than a function value. (?)
Right now I have to either specify all types or no types at all. Specifying the types in a map[string]string
(column name -> type name
) would add the possibility to specify types only for the columns you want to and fallback to auto typing for the other columns.
It could possibly also shorten the code in ReadRecords
by simply checking if the column name is in the map and if not fallback to findType
.
What do you think?
are there any similar package or fork of this package to help us work well with dataframes.
Please at least reply if you are planning to drop the package.
I'm currently having an issue when attempting to sort by multiple columns.
Given the following code (I'll explain the commented lines in a moment.):
package main
import (
"fmt"
"github.com/kniren/gota/dataframe"
)
func main() {
df := dataframe.LoadRecords(
[][]string{
{"A", "B"},
{"0.346", "662"},
{"0.331", "725"},
// { "0.33", "561"},
// { "0.322", "593"},
// { "0.322", "543"},
// { "0.32", "707"},
// { "0.32", "568"},
// { "0.318", "671"},
// {"0.318", "645"},
// { "0.314", "540"},
// { "0.312", "679"},
{"0.31", "682"},
{"0.309", "680"},
{"0.308", "695"},
{"0.307", "514"},
{"0.306", "530"},
// { "0.306", "507"},
// { "0.305", "597"},
{"0.304", "675"},
{"0.304", "718"},
// { "0.303", "576"},
// { "0.303", "515"},
// { "0.301", "605"},
// { "0.3", "645"},
// { "0.3", "566"},
{"0.299", "564"},
{"0.297", "665"},
{"0.297", "689"},
{"0.297", "507"},
{"0.295", "665"},
// { "0.295", "613"},
{"0.294", "577"},
{"0.293", "577"},
{"0.293", "586"},
{"0.293", "675"},
{"0.29", "589"},
{"0.288", "568"},
{"0.288", "630"},
{"0.288", "645"},
{"0.288", "573"},
},
)
fmt.Println(df.Arrange(dataframe.Sort("A"), dataframe.Sort("B")))
}
I get a correct output of:
[23x2] DataFrame
A B
0: 0.288000 568
1: 0.288000 573
2: 0.288000 630
3: 0.288000 645
4: 0.290000 589
5: 0.293000 577
6: 0.293000 675
7: 0.293000 586
8: 0.294000 577
9: 0.295000 665
... ...
<float> <int>
Now comes the reason for the commented out lines.
If I uncomment any of the commented lines, I get the following output.
[24x2] DataFrame
A B
0: 0.288000 645
1: 0.288000 568
2: 0.288000 573
3: 0.288000 630
4: 0.290000 589
5: 0.293000 577
6: 0.293000 675
7: 0.293000 586
8: 0.294000 577
9: 0.295000 665
... ...
<float> <int>
The order is no longer correct. Please note the "B" column.
Since I don't yet know what combination of values is causing the incorrect sorting, I've left them all commented out in the data. This is in the hopes of someone seeing something in the values that might trigger this incorrect behavior.
Any thoughts on what might be happening?
What's the easiest way to do a custom sort of a dataframe? For example, I have a column with string values like 10/04/2014 04:10:10 p.m.
and I would like to sort them by the date that represents (ascending).
If this is not easily possible, consider this a feature request. Thanks for a useful package.
One should be able to sort the DataFrame by one or several of it's columns.
func (d DataFrame) Arrange(keys string...) DataFrame { ... }
A possible implementation could start by enabling each Series to return a []int array containing it's sorted order. For example:
a := Strings("b", "c", "a")
var b []int = a.Order() // b == []int{2,3,1}
In case we have NA elements we should decide what to do with them. Maybe they all have the same order index and appear at the end?
a := Strings("b", nil, "c", nil,"a")
var b []int = a.order() // b == []int{2,4,3,4,1}
In any case, once we have an []int array for each key column we could calculate the new row order array and use it to sort the DataFrame.
Memory consumption is an issue when dealing with large data sets.
Similar in memory columnar stores like python's panadas and microsoft's proprietary vertipaq engine for it's ssas products have the ability to minimize memory usage by using techniques such as:
Value encoding - for numbers, vertipaq will calculate a number it can subtract from every row to lower the requirement of bits needed.
Categorical encoding (dictionary encoding) - for strings, pandas and vertipaq will create a lookup table and use integers to represent the data therefore reducing number of bits.
More info: https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3
This is a feature request for similar functionality.
We might want to change the type of a column or columns of a DataFrame. To do so, we could enable two methods, one for parsing a given column to the desired type and another one to change all of them at the same time.
func (d DataFrame) ChangeType(colname string, newtype SeriesType) DataFrame { ... }
func (d DataFrame) ChangeTypes(newtype []SeriesType) DataFrame { ... }
When comparing Series or filtering with df.F, the comparator should be moved to a string/int enum for maintainability, clarity and better type safety.
type Comparators string
const (
Eq Comparators = "eq"
In Comparators = "in"
...
)
Sometimes one might want to use a DataFrame where the duplicated elements are removed or want to identify the indexes where they appear.
I stumbled upon your repo while searching around to see if anybody is using the Golang hooks for the Apache Arrow library.
I read an article recently from Wes Mckinny about his involvement with Arrow and how he’s stoked to provide a more flexible Pandas API that would support parallelism out of the box with shared memory, rather than the default pickled approach.
I’d love to see this implemented in golang. I can imagine using the Plasma store API that it would be pretty easy.
Hi - I am trying to do a UNION ALL statement on 2 dataframes and am wondering if this is possible. I have loaded up 2 dataframes successfully but want a way to merge them together.
Thanks
--
Update - nevermind. just saw outerjoin :-)
Hello Kniren !
How do I add a record to a available dataframe?
df := dataframe.LoadRecords(
[][]string{
[]string{"A", "B", "C", "D"},
[]string{"a", "4", "5.1", "true"},
[]string{"k", "5", "7.0", "true"},
[]string{"k", "4", "6.0", "true"},
[]string{"a", "2", "7.1", "false"},
},
)
I want add 1 record in last df
when there is a big integer in the json file, like 20180428, the encoder will convert it to string as a float number, like '2.0180428e+07', the string cannot be correctly converted to a integer
So far there has not been an study on the performance of this library in terms of speed and memory consumption. I'm prioritising now those features that impact the users directly, since the API design is still on flux, but this should be addressed on the near future.
As it currently stands, when a Series
is created, its elements should not be able to change for other than subsetting operations. However, we might want to modify elements of a Series
once it has been initialised (Not necessarily by modifying the Series in situ but returning a new Series with the updated values).
An appropriate API should be designed for this purpose and then integrate Series modifications directly into DataFrame operations.
I have a data frame like this
df := dataframe.LoadRecords(
[][]string{
[]string{"A", "B", "C", "D"},
[]string{"a", "4", "5.1", "true"},
[]string{"k", "5", "7.0", "true"},
[]string{"k", "4", "6.0", "true"},
[]string{"a", "2", "7.1", "false"},
},
)
I want to filter column B to a value of 5, I want the return value to be int 2 .
2 is indexer of df
Sometimes it is really interesting to get a fast summary of the data contained in a DataFrame
quickly, with the dimensions, counts or quartile information depending on the type of the column. In R this could be done with something like summary(df)
. Perhaps we should try to mimic this functionality and expand upon it for quick data summarization.
Hi alex, thank you for your great work.
I noticed that the column names get lost after calling Rapply during my tests, also the detected types.
test codes:
package main
import (
"log"
"github.com/kniren/gota/dataframe"
"github.com/kniren/gota/series"
)
func main() {
df := dataframe.LoadRecords(
[][]string{
[]string{"A", "B", "C", "D"},
[]string{"a", "4", "5.1", "true"},
[]string{"k", "5", "7.0", "true"},
[]string{"k", "4", "6.0", "true"},
[]string{"a", "2", "7.1", "false"},
},
)
applied := df.Rapply(func(s series.Series) series.Series {
return s
})
log.Println(df)
log.Println(applied)
}
output:
2017/11/01 17:38:32 [4x4] DataFrame
A B C D
0: a 4 5.100000 true
1: k 5 7.000000 true
2: k 4 6.000000 true
3: a 2 7.100000 false
<string> <int> <float> <bool>
2017/11/01 17:38:32 [4x4] DataFrame
X0 X1 X2 X3
0: a 4 5.100000 true
1: k 5 7.000000 true
2: k 4 6.000000 true
3: a 2 7.100000 false
<string> <string> <string> <string>
Error handling could be improved by using errors.Wrap
and errors.Unwrap
for more descriptive error messages. Also, error handling for Series should be managed in the same way that is done on DataFrames, by reading from the Series.Err() method to retrieve the error message.
Ideally if we have a pipe of DataFrame operations we want to be able to track at which point it failed. Maybe to do so we have to store some piping information inside the DataFrame structure to know what the pipe operation looks like.
As it stands on v0.6.0
there are too many memory allocations as intermediate entities. This should be reviewed and corrected.
So while this is intentional, I wanted to make sure I understand reasoning before deciding how to work with this. Whereas Float() for example does not have two return parameters, Int() and Bool() do. What was the thinking there? Is the idea that some of these conversions more fellable than others?
// Accessor/conversion methods
Copy() Element // FIXME: Returning interface is a recipe for pain
Val() ElementValue // FIXME: Returning interface is a recipe for pain
String() string
Int() (int, error)
Float() float64
Bool() (bool, error)
Thanks a lot, sorry if I am being dense...
Cbind and Rbind needs to be updated to Capply and Rapply respectively.
Documentation clearly states it is Capply but readme might throw off newcomers.
Thanks
I have a dataframe
df := dataframe.LoadRecords(
[][]string{
[]string{"Name", "Total"},
[]string{"ABC", "4"},
[]string{"XYX", "5"},
[]string{"MNK", "4"},
[]string{"OPP", "2"},
},
)
I filter the Name column for the OPP value and I want to change the Total column from 2 to 3.
Hope for a help
Thank you
The Go stdlib documentation states that sort.Sort
does not guarantee stability of the sorted results (sort.Stable
does). Isn't stability a requirement when sorting the dataframe according to the content of multiple columns to guarantee correctness the way dataframe.Arrange
is currently implemented?
Hi,
I am reading from csv files where all the values are double quoted, even if they do not contain a comma (or whatever the delimiter is).
I read these into a DataFrame and then I do some transformation on it and write it back out to a CSV file. The resulting CSV file's values are not double-quoted unless they contain a comma.
I'd like to be able to pass a WriteOption to WriteCSV() that would force the quoting of all values written., even if they do not contain a delimiter. Just to have consistency between my input and output files.
If this request sounds weird, I will explain my use case. My input files are medical study data that contains personally identifiable information such as name and birth date. My code basically takes this information and changes it to random strings of characters that resemble the original but is no longer identifiable as a specific person. I take the resulting csv files and use them as test fixture data to test another code base. This fixture data can be checked into a public GitHub repository because it no longer contains identifying information. I would like the files to be identical in all respects to the original files (except for the identifying information) so that I can have confidence that my passing tests mean the code will also work with real data. That's why I want the csv files to have all fields double-quoted even if it does not seem necessary or is not called for by the CSV spec.
Does that make sense?
Thanks for a nice package.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.