Coder Social home page Coder Social logo

vincentlaucsb / pgreaper Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 1.0 36.48 MB

A Python library for loading data from various formats into PostgreSQL databases.

License: MIT License

Python 94.93% HTML 2.26% CSS 0.24% C++ 1.33% PLpgSQL 1.23%
sql-table sql-database sqlite3-database convert-data postgresql sql csv-converter html-parser python

pgreaper's Introduction

pgreaper

Build Status Coverage Status

PGReaper is the easy Pythonic way to upload data from CSV, JSON, HTML, and SQLite sources to PostgreSQL databases. Interally, it uses the fast COPY streaming protocol, but wraps it in a way that makes it more flexible, robust and easier to use. Features include:

  • Automatic schema inference
  • Ability to parse and normalize <table>s in HTML
  • Ability to flatten and/or extract nested keys from JSON before copying
  • Ability to copy files in .zip archives without decompressing them
  • Ability to copy over networks

Lastly, but certainly not least, PGReaper ships with its own Table data structure while also being able to copy pandas DataFrames. These allows programmatic creation or updating of SQL tables without the verbosity associated with traditional ORMs. Furthermore, when uploading Python data structures, PGReaper automatically infers the schema, including for JSONB (from dict or list) and timestamp (from datetime) objects.

Benchmarks

Speed is one of pgreaper's main design goals. A list of benchmarks may be found under the benchmarks subdirectory.

Installation

I've been using PGReaper heavily for my own projects, such as Twitter and web scraping, but I have only recently started polishing up the documentation and API for public consumption. PGReaper will be released on PyPI when I feel it is mature enough, but if you would like to use it now, you can clone this repository and run the following command where the files are extracted.

pip install .

Currently, you may need Cython to build this project.

Dependencies

PGReaper requires a minimum of dependencies. Namely, these are Python 3.5+, psycopg2, and obviously PostgreSQL (use 9.3+ for JSON features).

Documentation

Full documentation (or at least that's the goal!) is available here. As a user of software, I frequently get frustrated with inconsistent, inaccurate or unclear documentation. However, as a writer of software I also realize writing documentation isn't always fun and what is obvious to you isn't always obvious to others. If you see something that needs improvement, feel free to submit an issue or pull request.

pgreaper's People

Contributors

vincentlaucsb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

zahidsqldba

pgreaper's Issues

To Do/Improvements

Deploy to PyPI

  • Complete bolded tasks below before releasing first version

Features

  • Add ability to use Python's csv sniffer to auto-detect CSV format
  • Add Excel support (Easy)
  • Add JSON support (Hard) (Probably trivial in PostgreSQL)
  • Add ability to connect to PostgresSQL databases not on localhost
  • Add ability to take subsets of columns
  • Add a mutate()-like function for Tables
  • SQLite to PostgresSQL conversion

HTML Parsing

  • Add ability to parse HTML <table>s
  • Write unit tests for basic <table> layouts
  • Write documentation

Jupyter Notebook Integration

  • Add _repr_html_() method for Table

Aesthetics

  • Make Tables repr() prettier and more useful, e.g. showing a summary of number of columns
  • Consistent text width throughout the package

Performance

  • Add feature so that if file is formatted "correctly", use Postgres COPY instead of bulk INSERTs for better performance
  • Look into more database specific optimizations
  • Implement faster column guessing algorithm
  • Start benchmarking SQL uploads
  • Memory profiling for SQL uploads
  • Start benchmarking HTML parsing

User Friendliness

  • Add functions which allow a user to preview a file and how it will look before uploading
  • Add aliases for keyword arguments (e.g. db --> database, user --> username)
    • Add unit tests for aliases
    • Document them
  • If a row has more columns than others and this causes a DataError, report a more specific error message

Code

  • Add tests for N/A value replacement
  • Add tests using real world data sets
  • Add tests for PostgresSQL
  • Add tests for command line interface

Bugs

...

ImportError: No module named settings with Python 2

(sqlify2) dave@asr:/tmp$ python -V
Python 2.7.12
(sqlify2) dave@asr:/tmp$ pip list
Package    Version
---------- -------
click      6.7    
pip        9.0.1  
psycopg2   2.7.1  
setuptools 36.2.0 
sqlify     1.0.0a1
wheel      0.29.0 
(sqlify2) dave@asr:/tmp$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlify
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dave/.virtualenvs/sqlify2/local/lib/python2.7/site-packages/sqlify/__init__.py", line 1, in <module>
    from sqlify.sqlify import *
  File "/home/dave/.virtualenvs/sqlify2/local/lib/python2.7/site-packages/sqlify/sqlify.py", line 1, in <module>
    from sqlify.settings import *
ImportError: No module named settings
>>>

requests not installed and ModuleNotFoundError error with python 3

(sqlify3) dave@asr:/tmp$ python -V
Python 3.5.2
(sqlify3) dave@asr:/tmp$ pip list
Package    Version
---------- -------
pip        9.0.1  
psycopg2   2.7.1  
setuptools 36.2.0 
sqlify     1.0.0b2
wheel      0.29.0 
(sqlify3) dave@asr:/tmp$ python
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlify
/home/dave/.virtualenvs/sqlify3/lib/python3.5/site-packages/sqlify/config.py:108: UserWarning: No default Postgres settings found. Use sqlify.settings(username='', password='', database='', hostname='') to set them.
  warnings.warn("No default Postgres settings found. Use sqlify.settings(username='', password='', database='', hostname='') to set them.")
Traceback (most recent call last):
  File "/home/dave/.virtualenvs/sqlify3/lib/python3.5/site-packages/sqlify/html/parser.py", line 2, in <module>
    import requests
ImportError: No module named 'requests'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dave/.virtualenvs/sqlify3/lib/python3.5/site-packages/sqlify/__init__.py", line 55, in <module>
    from .html import from_file, from_url
  File "/home/dave/.virtualenvs/sqlify3/lib/python3.5/site-packages/sqlify/html/__init__.py", line 1, in <module>
    from .parser import get_tables_from_file as from_file, \
  File "/home/dave/.virtualenvs/sqlify3/lib/python3.5/site-packages/sqlify/html/parser.py", line 4, in <module>
    except ModuleNotFoundError:
NameError: name 'ModuleNotFoundError' is not defined
>>> quit()

To Do/Improvements Pt. 2

Deploy First Non-Beta Version

  • 80%+ code coverage

Features

  • Add ability to use Python's csv sniffer to auto-detect CSV format
  • Add Excel support
  • Add LaTeX output

HTML Parsing

  • Add JSON output

Performance

  • Memory profiling for SQL uploads
  • Start benchmarking HTML parsing

User Friendliness

  • Add functions which allow a user to preview a file and how it will look before uploading
  • Add aliases for keyword arguments (e.g. db --> database, user --> username)
    • Add unit tests for aliases
    • Document them
  • If a row has more columns than others and this causes a DataError, report a more specific error message

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.