Coder Social home page Coder Social logo

rcpppyarrow's Introduction

RcppPyArrow

This project helps python users and R users exchange data by standardizing on the C++ Arrow library.

Python developers that are building engineering systems may have the need to access R and its large collection of libraries. This can be done by using rpy2, which embeds an R process inside the python process. Python developers can invoke R functions and pass python objects into those functions. However, passing large datasets into the R process can have large overhead.

Arrow specifies a language agnostic columnar memory format for data, and its core is written in C++. The main class which is most similar to a dataframe is the Table Class. PyArrow is a python library that integrates with Arrow and exposes a PyArrow Table type. This class can wrap memory that was allocated by the C++ library. R can also wrap memory that was allocated in C++ through Rcpp. Given a pointer to an Arrow Table object, R can construct a dataframe.

Arrow provides a very convenient mechanism to exchange data between Python and R without having to write anything to disk, and without having to copy any memory. Python developers that need to pass data through rpy2 more efficiently can create a PyArrow table object, then pass the address of the underlying Arrow Table object to R. R can receive the pointer and instantiate a data frame from it using RcppPyArrow::RcppReceiveArrowTableFromPython. This is an efficient transfer of data because it reuses the memory allocated by Arrow in both Python and R, so the transfer from Python to R happens without serialization and without copy.

Installation

Your development environment must have access to libarrow.so and libarrow_python.so. To get these dependencies build the Arrow C++ project from source. When running cmake, you will need to make sure to include the flag -DARROW_PYTHON=ON.

You will also need Python headers. On Ubuntu this requires installing python-dev using sudo apt-get install python-dev.

RcppPyArrow uses a configure script to help compile and link the Rcpp code with libarrow.so and libarrow_python.so. The configure script needs access to 4 directories: the location of the arrow headers, the location of libarrow.so and libarrow_python.so, the location of python headers and the location of libpython2.7.so. These are passed in to configure using the variables ARROW_INCLUDE_DIR, ARROW_LIB_DIR, PYTHON_INCLUDE_DIR, and PYTHON_LIB_DIR. On Ubuntu 16.04 these directories might be

  • ARROW_INCLUDE_DIR=/usr/local/include
  • ARROW_LIB_DIR=/usr/local/lib
  • PYTHON_INCLUDE_DIR=/usr/include/python2.7
  • PYTHON_LIB_DIR=/usr/lib/x86_64-linux-gnu/

If that is the case, then RcppPyArrow can be installed with the command R CMD INSTALL ./ --configure-vars='ARROW_INCLUDE_DIR=/usr/local/include ARROW_LIB_DIR=/usr/local/lib/ PYTHON_INCLUDE_DIR=/usr/include/python2.7 PYTHON_LIB_DIR=/usr/lib/x86_64-linux-gnu/'

Usage

Python developers can use parquet files, arrow files, or Pandas dataframes to make a PyArrow Table. Using rpy2.rinterface.SexpExtPtr we can get an external pointer to the PyArrow Table object. This pointer can be passed to RcppPyArrow::RcppReceiveArrowTableFromPython, which will unwrap the underlying Arrow Table object and convert it to a R tibble.

import numpy as np
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"a": [1, 2, 3]})
table = pa.Table.from_pandas(df)

import rpy2.robjects as robjects
import rpy2.rinterface as rinterface

rinterface.initr()
func = robjects.r(
     """
     f = function(inputs) {
       require(RcppPyArrow)
       require(arrow)
       df = RcppReceiveArrowTableFromPython(inputs)
       print (dim(df))
       print (head(df))
     }
     """
)
param = rinterface.SexpExtPtr(table)
response = func(param)

rcpppyarrow's People

Contributors

jeffwong-nflx avatar

Stargazers

Thomas McDonough avatar chloe radford avatar Suvayu Ali avatar Romain François avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.