Wonderful work so far! Just dreaming of the future...

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hopefully by the end of this year. <span class="email-hidden-toggl

Pandas DataFrame and R data.frame translation about reticulate HOT 16 CLOSED

rstudio commented on July 23, 2024 1

Pandas DataFrame and R data.frame translation

from reticulate.

Comments (16)

terrytangyuan commented on July 23, 2024 3

@jjallaire pandas.DataFrame.as_matrix is the trick

from reticulate.

saurfang commented on July 23, 2024 3

Feather / Apache Arrow being a data frame serialization framework that supports both R and Python could be useful here. https://github.com/wesm/feather

I took a stab. Any suggestions are welcomed. I have not extensively tested it yet. https://github.com/saurfang/reticulate.df

from reticulate.

jjallaire commented on July 23, 2024 2

Hopefully by the end of this year.

…

On Sun, Jul 16, 2017 at 6:24 PM, Forest Fang ***@***.***> wrote: That's very exciting. Specifically, are you talking about converting pandas DataFrame to Apache Arrow format (or something similar) in a memory buffer, and reading that into R via Rcpp (to avoid disk serialization/compression and memory copy)? or would this be a more ambitious implementation of a data.frame backend that lives in external memory entirely? Any rough timeline that you might be working and releasing this? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAGXx3sy7UnmcVktZKuJH0OIDqmyEggyks5sOo22gaJpZM4L60Xi> .

from reticulate.

jjallaire commented on July 23, 2024 2

Correct, these are all very fast memory to memory copies of contiguous vectors.

from reticulate.

jjallaire commented on July 23, 2024

If Pandas data frames can be decomposed into NumPy arrays then it should be possible to do this without too much more invention.

from reticulate.

eddelbuettel commented on July 23, 2024

Well ... when I once needed to get NumPy into R the only way to do was to wrap an external (C) library as part of RcppCNPy. Maybe there are better ways now, I'd be eager to learn about them either way.

from reticulate.

jjallaire commented on July 23, 2024

As has been pointed out, right now this is possible by decomposing the R and/or Pandas data frame into vector / matrixes. I think this is all we will do for the foreseeable future, as fully handling data frames will involve dealing with character vectors, dates, list columns, etc. and end up being too large of a project.

from reticulate.

terrytangyuan commented on July 23, 2024

Makes sense. The handling for those would be very hard to maintain.

from reticulate.

jjallaire commented on July 23, 2024

Yes, feather definitely has all of the bits required to do this sorted out. Our plan is to add data frame support to reticulate using the same techniques as feather (sharing code if possible), but not to require a full serialize/deserialize to disk to do the conversion.

from reticulate.

saurfang commented on July 23, 2024

That's very exciting. Specifically, are you talking about converting pandas DataFrame to Apache Arrow format (or something similar) in a memory buffer, and reading that into R via Rcpp (to avoid disk serialization/compression and memory copy)? or would this be a more ambitious implementation of a data.frame backend that lives in external memory entirely?

Any rough timeline that you might be working and releasing this?

from reticulate.

shearerpmm commented on July 23, 2024

Did this end up working? I'm interested if the R-Python communication can be done with large dataframes without expensive ser/de

from reticulate.

jjallaire commented on July 23, 2024

Yes, this is now available: https://rstudio.github.io/reticulate/articles/calling_python.html#data-frames

from reticulate.

shearerpmm commented on July 23, 2024

Neat! Am I to understand that the discussion around arrays (no copies needed) also applies to dataframes?

from reticulate.

jjallaire commented on July 23, 2024

No, Pandas data frames created from NumPy arrays automatically make copies of the arrays. So there "no copy" going from R vector to NumPy array but there is ultimately a copy made by Pandas.

from reticulate.

shearerpmm commented on July 23, 2024

But that Pandas copy is memory-to-memory, so at least the disk is never involved?

from reticulate.

ParissaM commented on July 23, 2024

Hello,

For my case I was reading the data from an S3 bucket and I had the same issue. what helped for me was adding in the Python function the following snippet:

for c in df.columns:
    df[c] = np.array(df[c].values)

So the Python function would look like this:

def get_data_from_db(db_name, query):
    df = wr.athena.read_sql_query(
    sql=query,
    database=db_name,
    ctas_approach=False
    )
    for c in df.columns:
        df[c] = np.array(df[c].values) 
   return df

Hope this helps,
Regards

from reticulate.

Pandas DataFrame and R data.frame translation about reticulate HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent