Coder Social home page Coder Social logo

wildlifetradenetworks / pycites Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 0.0 139 KB

A Python package to download and interact with the CITES trade database

License: MIT License

Python 93.37% Makefile 6.63%
python wildlife dataset hacktoberfest cites

pycites's People

Contributors

ltirrell avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

pycites's Issues

Reorganize to more closely match other data interface packages

The original motivation for this package, citesdb, assembles the CITES data and releases it as a GitHub package, and then provides means to download it and access metadata.

Another Python data interface interface based on an R package, nhanes, includes data directly in the repo, along with metadata info.

CITES is quite large, so following the approach of citesdb makes sense here as well. Necessary tasks are to:

  • create a makefile that handles the creation and GitHub release of a combined CITES CSV file. pycites users could rebuild their own dataset, but the recommended usage will just pull the file from releases and load it in
  • add metadata access functions
  • include code to access the CITES website directly (like nhanes)

Use vaex instead of pandas for dataframes

With 20 million rows, pandas is quite slow for reading in data and manipulating it. After a quick assessment of modin and vaex, vaex seems like an easy to use and fast solution. modin was a bit slow for my use case. dask is another option, but based on benchmarks posted online, it also seems like it won't lead to much speed up over raw pandas (though lazy evaluation would probably lead to less swapping).

Infrastructure improvements

  • reorganization to separate downloading/assembling of raw data (from CITES) from loading assembled data (stored as compressed CSV on github)
  • add in additional columns to supplement missing data
    • use the ITIS taxonomy database to get Class, Family and Order classifications based off the Taxon column
  • makefile to build and assemble data from CITES, as well as create a taxonomically-supplemented dataset

out of memory

Unfortunately the program exits with an out of memory error:
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
My machine has 16 GB of RAM, hoping this is sufficient for the script to run.
Is there anything I can do to make it run? I am not as proficient in python as you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.