Coder Social home page Coder Social logo

rbuerki / compare_data_from_the_command_line Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 0.0 142 KB

An application to compare data from CSV files. Either to be used form the CLI or as importable library. Handles many edge cases, returns detailed feedback.

License: MIT License

Python 76.84% Jupyter Notebook 23.16%
pandas compare-data comparison csv-files tabular-data

compare_data_from_the_command_line's Introduction

Compare DataFrames

(Side project, Summer 2020, actual version: 0.3.0)

Intro

This application compares tabular data from two files that are loaded into Pandas DataFrames. It is especially helpful when a full comparison with Pandas' built-in df.equals(df) function is not possible or not usefull (e.g. because only subsets of the dataframes should be compared or because column names differ etc.). In such cases the app provides interactive step-by-step preprocessing and tries to handle many edge cases that I have encountered in my daily work.

The goal is to prepare the data such that a final boolean matching using Pandas' df.ne(df) function is as usefull as possible.

Features

Contrary to the project's title, the package can now be used either as:

  1. an importable library, for example to be used within a jupyter notebook
  2. a handy stand-alone CLI tool for super-quick data checks

The accepted input formats are:

  • Two CSV files
  • Two XLSX files
  • Two Pandas DataFrames

(Both have to be the same, you can not use different formats.)

Results are:

  • Standard-out process report and summary with the count of differing values per column (CLI and library versions)
  • Option to save a boolean dataframe to excel, indicating the exact locations of these differing values (CLI and library versions)
  • (Library version only) Return of 3 dataframes: The boolean 'df_diff' and the final states of the two processed input tables

Special features for processing are (same for both versions):

  • Possiblity to define specific load parameters for each file that will be passed to Pandas' read_csv or read_excel functions
  • Possiblity to enforce the same column names if these differ but the width of the 2 dataframes is the same
  • Handling of different shapes by finding matching subsets in the columns / indexes for the comparison
  • As far as possible: Handling of different dtypes as long as they are not of object type

Data prerequisites

  • Index values and index names of the two tables have to be consistent respective to the values they represent. Else a comparison is not possible. (One consequence: If for example new datapoints are added to a table, they have to be assigned to new index values while the existing index values are not allowed to change. Else they can not be compared with an earlier version of the table.)
  • If you pass a specific column name to be used as index (you can do this with the load_params, see example below), make sure it exists in both dataframes (if not, use the load_params to rename the columns). Also make sure the index columns have no duplicate values.

Usage

Command line Version

compare_df [options] [path_1] [path_2]

Available options are:

Prefix Description
-l_1, --load_params_1 Load params for file at path_1
-l_2, --load_params_2 Load params for file at path_2

Note: The optional load params have to be passed as single key-value-pairs in string format, each of them separatly for the respective dataframe. You can pass all the args that are accepted by pandas.read_csv or alternatively pandas.read_excel.

A full example could look as follows:

compare_df "data/file_manual.csv" "data/file_auto.csv" -l_1 "engine"="python" -l_1 "sep"=";" -l_1 "index_col"="customer_ID" -l_2 "encoding"="UTF-8" -l_2 "sep"=";" -l_2 "index_col"="customer_ID"

Library Version

>>> import compare_df
>>> df_diff, df_1, df_2 = compare_df.main(
    'path_1',
    'path_2',
    ['load_params_1'],
    ['load_params_2'],
)

Note: Contrary to the CLI version the optional load params are passed as dicts with key-value-pairs in string format. Again, you can pass all the args that are accepted by pandas.read_csv or alternatively pandas.read_excel.

A full example of calling the main() function could look as follows:

df_diff, df_1, df_2 = compare_df.main(
    "data/file_manual.csv",
    "data/file_auto.csv",
    load_params_1={"engine": "python", "sep": ";", "index_col": "customer_ID"},
    load_params_2={"encoding": "UTF-8", "sep": ";", "index_col": "customer_ID"},
)

Installation

Quick & dirty:

  • Clone or fork this repo to your machine
  • Activate a virtual envirenment of your choice (make sure you have pip installed)
  • Open a terminal, navigate to the repo's top-level folder (where setup.py is located)
  • Run the following command:
pip install .

Dependencies: Python >= 3.6, Pandas, xlsx_writer and openpyxl.

(Note: The last of them is used as default option for reading XLSX files. You could also pass another package in the load_params if desired.)

Aknowledgements / Resources

This project was essentially a little playground for experimenting with test driven development, working with a CLI and making a locally installable package (in development mode). The following resources got me started:

TODO

Possible new features for future versions:

  • Enable proper installation, add build / dist
  • Add a simple GUI (using a separate setuptools entry point)
  • Make an executable with PyInstaller (see here)

compare_data_from_the_command_line's People

Contributors

rbuerki avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.