Coder Social home page Coder Social logo

danielavdar / pandas-pyarrow Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 353 KB

Seamlessly switch Pandas DataFrame backend to PyArrow.

License: MIT License

Python 100.00%
arrow backend dtypes pandas pandas-dataframe pandas-tricks-for-data-manipulation pyarrow python db-dtypes pandas-pyarrow pandas-arrow

pandas-pyarrow's Introduction

pandas-pyarrow

PyPI - Python Version version License OS OS OS Tests Code Checks codecov Ruff

pandas-pyarrow simplifies the conversion of pandas backend to pyarrow, allowing seamlessly switch to pyarrow pandas backend.

Get started:

Installation

To install the package use pip:

pip install pandas-pyarrow

Usage

import pandas as pd

from pandas_pyarrow import convert_to_pyarrow

# Create a pandas DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c'],
    'C': [1.1, 2.2, 3.3],
    'D': [True, False, True]
})

# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = convert_to_pyarrow(df)

print(adf.dtypes)

outputs:

A     int64[pyarrow]
B    string[pyarrow]
C    double[pyarrow]
D      bool[pyarrow]
dtype: object

Furthermore, it's possible to add mappings or override existing ones:

import pandas as pd

from pandas_pyarrow import PandasArrowConverter

# Create a pandas DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c'],
    'C': [1.1, 2.2, 3.3],
    'D': [True, False, True]
})

# Instantiate a PandasArrowConverter object
pandas_pyarrow_converter = PandasArrowConverter(
    custom_mapper={'int64': 'int32[pyarrow]', 'float64': 'float32[pyarrow]'})

# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = pandas_pyarrow_converter(df)

print(adf.dtypes)

outputs:

A     int32[pyarrow]
B    string[pyarrow]
C     float[pyarrow]
D      bool[pyarrow]
dtype: object

pandas-pyarrow also support db-dtypes used by bigquery python sdk:

pip install pandas-gbq

or

pip install pandas-pyarrow[bigquery]
import pandas_gbq as gbq

from pandas_pyarrow import PandasArrowConverter

# Specify the public dataset and table you want to query
dataset_id = "bigquery-public-data"
table_name = "hacker_news.stories"

# Construct the query string
query = """
    SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1000
"""

# Use pandas_gbq to read the data from BigQuery
df = gbq.read_gbq(query)
pandas_pyarrow_converter = PandasArrowConverter()
adf = pandas_pyarrow_converter(df)
# Print the retrieved data
print(df.dtypes)
print(adf.dtypes)

outputs:

unique_key                               object
complaint_description                    object
source                                   object
status                                   object
status_change_date          datetime64[us, UTC]
created_date                datetime64[us, UTC]
last_update_date            datetime64[us, UTC]
close_date                  datetime64[us, UTC]
incident_address                         object
street_number                            object
street_name                              object
city                                     object
incident_zip                              Int64
county                                   object
state_plane_x_coordinate                 object
state_plane_y_coordinate                float64
latitude                                float64
longitude                               float64
location                                 object
council_district_code                     Int64
map_page                                 object
map_tile                                 object
dtype: object
unique_key                         string[pyarrow]
complaint_description              string[pyarrow]
source                             string[pyarrow]
status                             string[pyarrow]
status_change_date          timestamp[us][pyarrow]
created_date                timestamp[us][pyarrow]
last_update_date            timestamp[us][pyarrow]
close_date                  timestamp[us][pyarrow]
incident_address                   string[pyarrow]
street_number                      string[pyarrow]
street_name                        string[pyarrow]
city                               string[pyarrow]
incident_zip                        int64[pyarrow]
county                             string[pyarrow]
state_plane_x_coordinate           string[pyarrow]
state_plane_y_coordinate           double[pyarrow]
latitude                           double[pyarrow]
longitude                          double[pyarrow]
location                           string[pyarrow]
council_district_code               int64[pyarrow]
map_page                           string[pyarrow]
map_tile                           string[pyarrow]
dtype: object

Purposes

  • Simplify the conversion between pandas pyarrow and numpy backends.
  • Allow seamlessly switch to pyarrow pandas backend, even for problematic dtypes such float16 or db-dtypes.
  • dtype standardization for db-dtypes used by bigquery python sdk.

example:

import pandas as pd

# Create a pandas DataFrame
df = pd.DataFrame({

    'C': [1.1, 2.2, 3.3],

}, dtype='float16')

df.convert_dtypes(dtype_backend='pyarrow')

will raise an error:

pyarrow.lib.ArrowNotImplementedError: Unsupported cast from halffloat to double using function cast_double

but with pandas-pyarrow:

import pandas as pd

from pandas_pyarrow import convert_to_pyarrow

# Create a pandas DataFrame
df = pd.DataFrame({

    'C': [1.1, 2.2, 3.3],

}, dtype='float16')
adf = convert_to_pyarrow(df)
print(adf.dtypes)

outputs:

C    halffloat[pyarrow]
dtype: object

Additional Information

When converting from higher precision numerical dtypes (like float64) to lower precision (like float32), data precision might be compromised.

pandas-pyarrow's People

Contributors

danielavdar avatar dependabot[bot] avatar

Stargazers

Oren Sultan avatar  avatar

Watchers

Kostas Georgiou avatar  avatar

pandas-pyarrow's Issues

dtype pyarrow mpped to string[pyarrow].

Describe the bug
A clear and concise description of what the bug is.

To Reproduce

@Parametrization.case(
    name="float32 case",
    df_data=pd.DataFrame({"test_column": [1.0, 2.0, 3.0, None]}, dtype="float32[pyarrow]"),
    expected_dtype="float32[pyarrow]",
)

produce dtype "string[pyarrow]"

Expected behavior
expected_dtype="float32[pyarrow]"

system
All.

Unintended Installation of pandas-gbq Package as Dependency

Describe the bug
A clear and concise description of what the bug is.

Expected behavior
The pandas-gbq package should not be included in the regular dependencies list and should only be installed upon explicit request by the user.

system
all

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.