Is your feature request related to a problem? Please describe. Th

Allow more than parquet files in redshift copy from files function about aws-sdk-pandas HOT 3 OPEN

m1hawkgsm commented on July 22, 2024

Allow more than parquet files in redshift copy from files function

from aws-sdk-pandas.

Comments (3)

LeonLuttenberger commented on July 22, 2024

Hey,

One of the issues with formats like CSV is that, unlike Parquet or ORC, they don't store metadata on things such as column types. In order to infer types and transform them to the corresponding Redshift types, we need to load the whole table.

As such, if redshift.copy_from_files supported CSV files, it would be equivalent to just loading the CSV data using s3.read_csv and then invoking redshift.copy with the DataFrame. This also presents a simple workaround for your issue:

df = wr.s3.read_csv("s3://...", ...)
wr.redshift.copy(df=df, path=temp_path, table=table_name)

Let me know if this helps,
Leon

from aws-sdk-pandas.

m1hawkgsm commented on July 22, 2024

Yeah I was thinking about that, and it makes sense. On the other hand, it is often the case that the Postgres unload (or other operations that yield CSV data, for that matter) yields files that are very large (> 20 GB), and renders loading locally quite infeasible in some cases, and inefficient in other cases (after all, that is the point of large, parallel bulk operations, right?).

Would it make sense to allow CSVs so long as you pass in the schema manually? The benefit here is enabling the reuse of how the package does merge/upserts behind the scenes (which I am ending up implementing on my own otherwise).

from aws-sdk-pandas.

Allow more than parquet files in redshift copy from files function about aws-sdk-pandas HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent