wildlifetradenetworks / pycites Goto Github PK

View Code? Open in Web Editor NEW

3.0 2.0 0.0 139 KB

A Python package to download and interact with the CITES trade database

License: MIT License

Python 93.37% Makefile 6.63%

python wildlife dataset hacktoberfest cites

pycites's People

Contributors

Stargazers

Watchers

pycites's Issues

Reorganize to more closely match other data interface packages

The original motivation for this package, citesdb, assembles the CITES data and releases it as a GitHub package, and then provides means to download it and access metadata.

Another Python data interface interface based on an R package, nhanes, includes data directly in the repo, along with metadata info.

CITES is quite large, so following the approach of citesdb makes sense here as well. Necessary tasks are to:

create a makefile that handles the creation and GitHub release of a combined CITES CSV file. pycites users could rebuild their own dataset, but the recommended usage will just pull the file from releases and load it in
add metadata access functions
include code to access the CITES website directly (like nhanes)

Use vaex instead of pandas for dataframes

With 20 million rows, pandas is quite slow for reading in data and manipulating it. After a quick assessment of modin and vaex, vaex seems like an easy to use and fast solution. modin was a bit slow for my use case. dask is another option, but based on benchmarks posted online, it also seems like it won't lead to much speed up over raw pandas (though lazy evaluation would probably lead to less swapping).

Infrastructure improvements

reorganization to separate downloading/assembling of raw data (from CITES) from loading assembled data (stored as compressed CSV on github)
add in additional columns to supplement missing data
- use the ITIS taxonomy database to get Class, Family and Order classifications based off the Taxon column
makefile to build and assemble data from CITES, as well as create a taxonomically-supplemented dataset

out of memory

Unfortunately the program exits with an out of memory error:
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
My machine has 16 GB of RAM, hoping this is sufficient for the script to run.
Is there anything I can do to make it run? I am not as proficient in python as you.

wildlifetradenetworks / pycites Goto Github PK

pycites's People

Contributors

Stargazers

Watchers

pycites's Issues

Reorganize to more closely match other data interface packages

Use vaex instead of pandas for dataframes

Infrastructure improvements

out of memory

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent