Coder Social home page Coder Social logo

Comments (16)

terrytangyuan avatar terrytangyuan commented on July 23, 2024 3

@jjallaire pandas.DataFrame.as_matrix is the trick

from reticulate.

saurfang avatar saurfang commented on July 23, 2024 3

Feather / Apache Arrow being a data frame serialization framework that supports both R and Python could be useful here. https://github.com/wesm/feather

I took a stab. Any suggestions are welcomed. I have not extensively tested it yet. https://github.com/saurfang/reticulate.df

from reticulate.

jjallaire avatar jjallaire commented on July 23, 2024 2

from reticulate.

jjallaire avatar jjallaire commented on July 23, 2024 2

Correct, these are all very fast memory to memory copies of contiguous vectors.

from reticulate.

jjallaire avatar jjallaire commented on July 23, 2024

If Pandas data frames can be decomposed into NumPy arrays then it should be possible to do this without too much more invention.

from reticulate.

eddelbuettel avatar eddelbuettel commented on July 23, 2024

Well ... when I once needed to get NumPy into R the only way to do was to wrap an external (C) library as part of RcppCNPy. Maybe there are better ways now, I'd be eager to learn about them either way.

from reticulate.

jjallaire avatar jjallaire commented on July 23, 2024

As has been pointed out, right now this is possible by decomposing the R and/or Pandas data frame into vector / matrixes. I think this is all we will do for the foreseeable future, as fully handling data frames will involve dealing with character vectors, dates, list columns, etc. and end up being too large of a project.

from reticulate.

terrytangyuan avatar terrytangyuan commented on July 23, 2024

Makes sense. The handling for those would be very hard to maintain.

from reticulate.

jjallaire avatar jjallaire commented on July 23, 2024

Yes, feather definitely has all of the bits required to do this sorted out. Our plan is to add data frame support to reticulate using the same techniques as feather (sharing code if possible), but not to require a full serialize/deserialize to disk to do the conversion.

from reticulate.

saurfang avatar saurfang commented on July 23, 2024

That's very exciting. Specifically, are you talking about converting pandas DataFrame to Apache Arrow format (or something similar) in a memory buffer, and reading that into R via Rcpp (to avoid disk serialization/compression and memory copy)? or would this be a more ambitious implementation of a data.frame backend that lives in external memory entirely?

Any rough timeline that you might be working and releasing this?

from reticulate.

shearerpmm avatar shearerpmm commented on July 23, 2024

Did this end up working? I'm interested if the R-Python communication can be done with large dataframes without expensive ser/de

from reticulate.

jjallaire avatar jjallaire commented on July 23, 2024

Yes, this is now available: https://rstudio.github.io/reticulate/articles/calling_python.html#data-frames

from reticulate.

shearerpmm avatar shearerpmm commented on July 23, 2024

Neat! Am I to understand that the discussion around arrays (no copies needed) also applies to dataframes?

from reticulate.

jjallaire avatar jjallaire commented on July 23, 2024

No, Pandas data frames created from NumPy arrays automatically make copies of the arrays. So there "no copy" going from R vector to NumPy array but there is ultimately a copy made by Pandas.

from reticulate.

shearerpmm avatar shearerpmm commented on July 23, 2024

But that Pandas copy is memory-to-memory, so at least the disk is never involved?

from reticulate.

ParissaM avatar ParissaM commented on July 23, 2024

Hello,

For my case I was reading the data from an S3 bucket and I had the same issue. what helped for me was adding in the Python function the following snippet:

for c in df.columns:
    df[c] = np.array(df[c].values)

So the Python function would look like this:

def get_data_from_db(db_name, query):
    df = wr.athena.read_sql_query(
    sql=query,
    database=db_name,
    ctas_approach=False
    )
    for c in df.columns:
        df[c] = np.array(df[c].values) 
   return df

Hope this helps,
Regards

from reticulate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.