zypp-io / df_to_azure Goto Github PK

1.0 1.0 2.0 318 KB

Repository for Automatically creating pipelines with copy Activity from blob to SQL. The main functionality is for fast uploading pandas DataFrames to a SQL database, using Azure Data Factory.

License: Apache License 2.0

Python 100.00%

azure blob-storage bulk-insert dataframe incremental sql sql-table upsert

df_to_azure's People

Contributors

Stargazers

Watchers

Forkers

ajeetpandeyy dbobbgit

df_to_azure's Issues

BUG: change CI name from method_usage to df_to_azure

BUG: create schema before push at create

[ENH] add option increment

Make an option for incremental updates. An extra parameter is necessary: "id field"

ENH: Export to parquet instead of .csv

in df_to_azure/export.py on line 200 we export the file to a .csv file:

data = self.df.to_csv(index=False, sep="^", quotechar='"', lineterminator="\n")

We should change this to .parquet for optimization. It requires a changes in the Linked Services in adf.py on line 173 :

ENH: move auth azure out of module

Nu wordt de database connectie gemaakt in de module, daarmee moeten de users nog meer environment variabele definiëren voor de module. Ik denk dat het beter zou zijn als we connectie als argument accepteren in onze functies die met de database te maken hebben en dat stuk dus buiten deze module laten.

@melvinfolkers wat denk jij?

DOC: Fix readme

Remove mentioning of yml
Add UPSERT as method.

BUG: Create tmp folder if it does not exist

BUG: invalid characters in resource name

output from console

azure.core.exceptions.HttpResponseError: The specifed resource name contains invalid characters.
RequestId:1fe04a1b-501e-0019-2190-221067000000
Time:2022-02-15T17:20:16.7288425Z
ErrorCode:InvalidResourceName

output from pip list, regarding azure packages:

azure-common           1.1.26
azure-core             1.11.0
azure-identity         1.5.0
azure-keyvault-secrets 4.2.0
azure-mgmt-core        1.2.2
azure-mgmt-datafactory 1.0.0
azure-mgmt-resource    15.0.0
azure-storage-blob     12.7.1

BUG: container name does not exist

Problem: When the container "dftoazure" does not exist, df_to_azure exits with an ContainerNotFound error.

Code from adf.py:

def create_blob_container():
    blob_service_client = create_blob_service_client()
    try:
        blob_service_client.create_container("dftoazure")
    except:
        logging.info("CreateContainerError: Container already exists.")

ENH: add method to upload dataframe as parquet file storage

The main functionality now is to upload a pandas dataframe to a sql database through datafactory. In many cases we make use of parquet files instead of a database.

The proposal is to add a method to export a dataframe to parquet and upload it to azure storage. When method="append" is used, we need to download the parquet file and append the newest data.

df_to_azure(df=df, tablename="table_name", folder="schema_name", container="containername", method="create", parquet=True)

BUG: update identity Azure

ENH: Package repo

ENH: create container

add functionality for creating a container if it does not exist. The only problem is that you need the account_key for the container.

BUG: space in column name upsert causes error

When there's a space in a column we get an error in the UPSERT method.

This needs to be fixed by enclosing them in brackets: [Column Name]

[FIX] add pyyaml to requirements

BUG: rename AZURE_TO_DF_SETTINGS env variable to DF_TO_AZURE_SETTINGS

BUG: categorical dtypes give error in column mapping to sqlalchemy types.

[FIX] fix relative import

In order to use this repository as a submodule, changes have to be made to the import statements.
Currently scripts get imported as follow:

from src.functions import my_function

in order for it to work in a submodule, the references have to be changed to:

from .functions import my_function

update readme

[ENH] create first version

There are already some scripts created in another repository. The plan is to copy these scripts and adjust them here and there

BUG: add categorical dtype for type conversion

When there's a Categorical dtype column present in the dataframe, it will error on the type conversion. It needs to be added to the dictionary with type:

df_to_azure/df_to_azure/export.py

Lines 204 to 220 in 9f1f8ca

    
           type_conversion = { 
        
               dtype("O"): string, 
        
               StringDtype(): string, 
        
               dtype("int64"): Integer(), 
        
               dtype("int32"): Integer(), 
        
               dtype("int16"): Integer(), 
        
               dtype("int8"): Integer(), 
        
               Int64Dtype(): Integer(), 
        
               dtype("float64"): numeric, 
        
               dtype("float32"): numeric, 
        
               dtype("float16"): numeric, 
        
               dtype("<M8[ns]"): DateTime(), 
        
               dtype("bool"): Boolean(), 
        
               BooleanDtype(): Boolean(), 
        
           } 
        
           col_types = {col_name: type_conversion[col_type] for col_name, col_type in self.df.dtypes.to_dict().items()}

ENH: Exclude logging from Azure packages

create project layout

creating initial project layout

BUG: Number of characters is different between the Copy action CSV and dataframe

When writing a dataframe to a SQL database it gives an error in ADF because of max number of characters exceeded (Operation on target Copy all_boards to SQL failed: Failure happened on 'Sink' side. ErrorCode=SqlBulkCopyInvalidColumnLength,.... This was the case with a column with a string version of a list of dicts.
The max characters in Python were 1870, so the schema definition in the database became varchar(1870), in the CSV the max length of values was 1870 as wel after checking as a txt file and after loading it again in Python.

However in ADF it is causing the error during the copy action. It does not give the error if we manually enter the text_length parameter in df_to_azure() as 1873

Documentation method and id_field

Your documentation states that id_field only should be used with upsert. The example on your page shows id_field with the create method:

from df_to_azure import df_to_azure

df_to_azure(df=df, tablename="table_name", schema="schema", method="create", id_field="col_a")

BUG: error for wrong method

We have to add an exception in case a method is used that is not supported by this package:

for instance this method does not exist

method="insert"

instead of throwing an error, it appends the dataset to the sql database.

ENH: add upsert method for parquet

Now that we have parquet support, we should add logic to do upsert on parquet files.

In general we this will be DataFrame.combine_first, we probably need some good testing for this, how will this deal with NaN for example and datetime indices:

df1 = pd.DataFrame(
    {
        "id": ["A", "B", "C"],
        "val1": [10, 20, 30],
        "val2": [40, 50, 60]
    }
)

df2 = pd.DataFrame(
    {
        "id": ["A", "B", "C", "D"],
        "val1": [15, 20, 30, 35],
        "val2": [40, 52, 60, 70]
    }
)

print(df1, "\n")
print(df2, "\n")
print(df2.set_index("id").combine_first(df1.set_index("id")).reset_index())

  id  val1  val2
0  A    10    40
1  B    20    50
2  C    30    60 

  id  val1  val2
0  A    15    40
1  B    20    52
2  C    30    60
3  D    35    70 

  id  val1  val2
0  A    15    40
1  B    20    52
2  C    30    60
3  D    35    70

	type_conversion = {
	dtype("O"): string,
	StringDtype(): string,
	dtype("int64"): Integer(),
	dtype("int32"): Integer(),
	dtype("int16"): Integer(),
	dtype("int8"): Integer(),
	Int64Dtype(): Integer(),
	dtype("float64"): numeric,
	dtype("float32"): numeric,
	dtype("float16"): numeric,
	dtype("<M8[ns]"): DateTime(),
	dtype("bool"): Boolean(),
	BooleanDtype(): Boolean(),
	}

	col_types = {col_name: type_conversion[col_type] for col_name, col_type in self.df.dtypes.to_dict().items()}