diskframe / disk.frame Goto Github PK

View Code? Open in Web Editor NEW

594.0 594.0 40.0 4.47 MB

Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data

Home Page: https://diskframe.com

License: Other

R 99.13% C++ 0.83% Batchfile 0.04%

data data-science large-dataset manipulation-data medium-data r

disk.frame's People

Contributors

Stargazers

Watchers

Forkers

bonnyc marcusklik kuzmenkov111 cderv dirtyearl ginberg fanjinfei zhouyonglong jingmouren rserran iqis b-rodrigues hansvomkreuz minghao2016 jackyp privefl caewok voltek62 shians romainfrancois mandixbw andthewings cule syedmfuad trsilva32 qezz thecodemasterk jimsforks younghai pherephobia econjoseph alekoure hyeyeankkim iago-noncontributedforks kol12303 shadrack-oo knutjaegersberg ben-schwen davisvaughan thisisnic

disk.frame's Issues

comprehensive test suite

TODO:

dplyr verbs is DONE. Only minor things to go through

should focus on

mention the future of disk.frame is bright

http://datascience.la/big-ram-is-eating-big-data-size-of-datasets-used-for-analytics/

large graphs in tidy format - similar to "GraphFrames"

It would be great if disk.frame is able to support analysis of large networks via tibble forms of graphs (aka tidygraph). Since Spark started to support GraphFrames maybe disk.frame can step into that domain as well.

tidygraph allows tibble access to nodes and edges separately but sharding large newtork into chunks might be quite challenging.

Plan to submit to cran

Hi, is there any plans to make this package on cran?

API for saving to disk or load into memory; currently it loads into memory by default

implement rechunk

Allows the users to increase or decrease the number of chunks

Overwrite = T can cause some incorrect behaviour

When overwrite = T and the folder already contains some chunks then the overerite may overwrite too few files. Therefore its better to move all the chunks to a separate folder before writing to the folder

Restartable operations

Some operations take a long time and it fails mid way through then one needs to restart from the beginning. To solve this we can make a detailed plan of what to do an execute each step and if it fails wr can restart at any point. Perhaps the drake package csn help with this

overload the `%>%` from magrittr so that any function will be lazy

If you overload %>% then

df %>%
    some_func

can be made to apply to any function, as long as some_func isn't already implemented.

implement generalised form of `[.disk.frame`

Currently [.disk.frame can only accept i, j, by so not all features of data.table are supported. This should be generalised to [.disk.frame where all features of [ are supported.

memory limit for processing of fst files

While I was processing 12GB of data as mentioned in #77 , I'm barely making it in terms of RAM usage. I have 16GB of RAM and I close most programs (including Rstudio) and use R console to do group_by operations.

So, I was wondering, what is the limit of processing files with 16GB of RAM? Can I process 100GB file, for example with disk.frame in my computer?

Because I was expecting that if I split the files into many more chunks then RAM should not be a problem just it will took longer to process so much files.

If I split files into even smaller sizes, can I process even larger files?

(again sorry for too much issues today, this is last one for today)

removing `dtplyr` as a dependency

finalise Fannie Mae tutorial

Allow data.frames to be treated as disk.frame

`collect(parallel= T)`, `collect(parallel= F)` and `[` might return rows in a different order

Be able to read sas

discrepancy in sharding

This case is related to #77 where reading multiple csv files were discussed

When I use shardby I get discrepant results for counts. Below is the summary of my findings

UniqueCarrier	disk.frame shard	disk.frame not_sharded	actual
AA	5263987	6655735	6655735
AS	795393	994764	994764
CO	3187682	4017862	4017862
DL	6247363	7884309	7884309
EA	701362	919785	919785
HP	1434364	1809250	1809250
ML (1)	56231	70622	70622
NW	3502394	4410734	4410734
PA (1)	240546	316167	316167
PI	672802	873957	873957
PS	70563	83617	83617
TW	1918283	2421955	2421955
UA	4731504	5938506	5938506
US	5752829	7295919	7295919
WN	3428952	4231882	4231882

More detailed analysis for sharded and not-sharded approaches are available.

show a glimpse of the underlying data when printing

Scaling out to clusters

Hello,
I like disk.frame a lot, which kind of compares good to dask in python. Isnt it possible to include functionality to scale out amongst multiple machines? i.e.
https://stackoverflow.com/questions/43952001/how-to-do-parallel-computing-inside-a-cluster-with-the-r-future-package

Fixed width files

This is an enhancement request, but I can't see how to designate it as such.

disk.frame looks to be wonderfully valuable. Many thanks in advance.

It would be helpful if the csv reading capacity could be extended to fixed-width files, as these files (often in the form of logs, etc) are typically massive.

The readr::read_fwf() is a nice implementation of fwf input, and might be a model for work on something comparable for this package.

Many thanks

more descriptive error for get_chunk(n) for n not exists

fix the bug with group_by when comparing keys

ccurrently, even if the group_by keys are the same in group_by and the data shard_by, it will still show an error message

add `add_chunk` function to add a chunk to a disk.frame

this is the same as append except it will perform type checking and produce warnings.

enhance `rbindlist.disk.frame` algorithm for faster merging many fst folders

Sorry for submitting many issues, I already mentioned this case in #77 , but I really liked the package and wanted to share my wishes.

rbindlist.disk.frame function is quite handy for merging data, but it replicates the data. For rbindlist.disk.frame function, I wanted to propose the following. Let's say we have too folders, data1 and data2 with following contents:

├── data1
│   ├── 11.fst
│   ├── 15.fst
│   └── 9.fst
└── data2
    ├── 11.fst
    ├── 13.fst
    ├── 16.fst
    └── 7.fst

In that case, row binding the data should consist of following steps:

create new folder merged
move non-clashing fst files to merged: data1/9.fst, data1/15.fst, data2/7.fst, data2/13.fst, data2/16.fst
for clashing data, move data1/11.fst to merged then row_bind data2/11.fst to it
update the .metadata folder contents for merged folder

My fst knowledge is limited, maybe this is not possible, but rbindlist.disk.frame takes long time when data gets bigger.

reading multiple csv files

First of all, kudos for this package, I hope it becomes as good as dask one day..

I was wondering if it's possible to read multiple large csv files in parallel with disk.frame? This is the initial demonstration of dask in many talks/presentations. However, in many sparklyr or similar demonstrations, flights data is copied to database or backend and reading multiple csv files in parallel is ignored.

As far as I can tell sparklyr's read csv function allows wildcard so that many csv files can be imported at once (e.g. "200*.csv"). I checked the vignette of disk.frame and couldn't find a hint about this. What is the easiest way to process multiple large csv files?

I can think of a workaround in which each file is processed individually to generate fst files then (maybe) those files can be merged, resharded? If I run the following code, there'll be fst files in separate folders. Is it possible to merge those fst files in single folder later (or during import)?

flights6.df <- csv_to_disk.frame("2006.csv", outdir="tmp2006.df",overwrite=T)
flights7.df <- csv_to_disk.frame("2007.csv", outdir="tmp2007.df",overwrite=T)
flights8.df <- csv_to_disk.frame("2008.csv", outdir="tmp2008.df",overwrite=T)

Allow CSVs to be in disk.frame

Allow CSVs to be a backend for disk.frame.

distributed data processing

However, disk.frame currently cannot distribute data processes over many computers, and is, therefore, single machine focused.

Just wanted to drop a working example of distributed processing: https://github.com/jangorecki/big.data.table

comprehensive coverage of all dplyr verbs e.g. group_if and tally

Wait for dplyr v1 to settle all verbs
Implement them

move all test data to test_data folder

mechansim to report granular timing

Ability to record timing based on computation time and writing to fst time

Fuzzy Join at the chunk level

Great package, still learning how to work with it.

It would be great to allow for non-equi joins a the chunk level. Currently my code looks something like this for data.frame:

library(disk.frame)
library(lubridate)
library(dplyr)
library(tidyverse)

prices <- data.table(strike = seq(0,1000,by=1)) 
dates <- data.table(seq(ymd('2017-01-01'),ymd('2017-09-01'), by = '1 day'))
ticker <- data.frame(ticker = letters[1:26])

t = ticker %>% 
  crossing(dates) %>% 
  group_by(ticker) %>% 
  mutate(mid = 50+cumsum(rnorm(n = 244,mean=0.05)))
 
RICs = t %>% 
  group_by(ticker) %>% 
  mutate(midd = mid*0.98, midu = mid*1.02) %>% 
  fuzzyjoin::fuzzy_left_join(prices, by = c("midd" = "strike","midu" = "strike"), match_fun = list(`<=`,`>=`))

In the above example, both t and prices are data frames, and when joined generate a larger than ram data.frame. If I could do the join chunk by chunk, it would be much faster as it is currently single thread CPU bound.

Ideally, t would be a disk.frame, with each chunk fed in separately.

Is there currently a better way of doing this?

Investigate if vroom's performance gain is worth it

vroom is a package for loading CSVs. Supposedly faster than data.table

the interchange between disk.frame and data.frame

Dear Xiaodai,
Thanks for the great work of releasing R from the constraint of RAM. As a heavy user of data.table, I am wondering how I can seamlessly working with data.table(DT) and disk.frame(DF) together.
I tried few time, and found that. After converting a dataset into DF format, the common DT way is still workable, but when I need to assign it to a new variable, the new variable automatically becomes a DT.

DF <- as.disk.frame(
  data.table(X=rep(letters[1:5],2), Y=1:10),
  outdir = "temp",
  overwrite = T
)

class(DF)   #[1] "disk.frame"        "disk.frame.folder"
class(DF[,.N, by = .(X)]) #[1] "data.table" "data.frame"

It seems that the disk.frame serves as a storage keeper. It does save lot of space by saving everything in warehouse, and get things out for others to process.
Am I right?
Thanks for the great work again,
Eric

Errors when chunk_size is set but column names are not set

Separate out xgboost and other modelling functions into a separate package

Unclear which step caused an error in a sequence of lazy functions

If a disk.frame is lazied a few times, and one of the lazy steps doesn't work, it's unclear which, which makes debugging hard.

Need to clearer error messages to which step caused the error and give helpful messages.

add benchmarks vs other systems

Dask
Spark
SAS
JuliaDB.jl

Be able to sort the disk.frames

Easy way to select options (via Shiny?)

There are many hidden options in disk.frame. PRovide an easy way for the user to select these options via a shiny interface. Eg. number of workers, and lazy by default.

Data can be saved as different formats for different chunks

Some string columns can have values like 01 and so can appear as integer when saved as CSV.

Need a step at the end to ensure consistency of data types

can't find object when comparing in data.frame or data.table.

Hi, XiaoDai,
Me again. Just found a critical issue that the data stored in DF format can't do match job.

#--------- Disk.Frame way, doesn't work.
DT <- as.disk.frame(
  data.table(A = letters[1:24], B = 1:48),
  outdir = "TEST",
  overwrite = TRUE
)

X <- "x"
DT[A == X]
# Error in eval(stub[[3L]], x, enclos) : 找不到对象'X'
# can't find object'X'

DT[A %in% X] # still can't find object

#---- Ordinary way, it works.

DX <- data.table(A = letters[1:24], B = 1:48)
DX[A == X]
#    A  B
# 1: x 24
# 2: x 48

Please advise if I have to do some change on doing matching or search.
Thanks
Eric

obtain environment variables in dplyr

the below will not work. Need allow filter to know what x is

x = 1
df  %>% filter(id == x)

Remove .fst from underlying file name to allow merging for disk.frame with different underlying file formats

The file format can be stored in the .metadata folder. Currently, only fst allows for random access to columns and rows. So there isn't another format that can be used.

Store compression level in metadata

When loading CSV files the individual chunks' column type may differ

Sharding Example

I'd like to understand how to create a disk.frame using shards, but couldn't really find an example.

Talked about here

My use case is a dataset covering many years (already a single large fst file!). I would like to shard by year, so that each chunk just contains data from a single year.