Coder Social home page Coder Social logo

omardbaa / data-splitter Goto Github PK

View Code? Open in Web Editor NEW
11.0 1.0 0.0 3.42 MB

Data-Splitter is a Python script designed to split a large CSV file containing data into three different formats: JSON, a database table, and another CSV file. The script ensures a random distribution of data across the three output formats based on custom-defined ratios.

License: MIT License

Jupyter Notebook 100.00%
csv-processing csv-to-json data-analysis data-conversion data-engineering data-integration data-manipulation data-splitting data-transformation database-connection

data-splitter's Introduction

Data Splitting Project

Introduction

This project is a Python script designed to split a CSV file containing data into three parts: JSON, a database, and CSV. The script provides flexibility to customize the distribution percentages for each destination. The primary goal of this script is to help users efficiently split large datasets into various formats for different purposes.

Features

  • Randomly shuffles the data for even distribution among the output files.
  • Supports custom distribution percentages for JSON, database, and CSV.
  • Saves the data into a JSON file, a CSV file, and a specified database table.
  • Displays statistics after data insertion, such as the number of rows in each output file and the percentage of data retained.

Requirements

  • Python 3.x
  • pandas
  • numpy
  • sqlalchemy

How to Use

  1. Ensure you have Python 3.x installed.

  2. Install the required libraries by running the following command: pip install pandas numpy sqlalchemy

  3. Prepare your CSV file and place it in the same directory as the Python script.

  4. Update the script variables according to your configuration:

  • csv_file_path: The filename of the CSV file to be split.
  • json_file_path: The filename for the JSON output file.
  • db_server: The server name or address of your SQL Server.
  • db_name: The name of the database where the table will be created.
  • db_driver: The ODBC driver for the SQL Server.
  • db_table: The name of the table in the database.
  • trusted_connection: Set to "yes" if using trusted connection; otherwise, set to "no".
  • distribution: A tuple representing the distribution percentages for JSON, database, and CSV, respectively.
  • header: Set to True if the CSV file has a header; otherwise, set to False.
  1. Run the Python script: python script_name.py

Example

Suppose you have a large CSV file named cleaned_repositoriesV2.csv that contains data on repositories. You can use this script to split the data into JSON, CSV (without header), and insert a portion into a database table.

Notes

  • Make sure the CSV file has a header in the first row.
  • Before running the script, ensure that your SQL Server is running and accessible from your Python environment.

data-splitter's People

Contributors

omardbaa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.