Search Online Collections: Abstracting away datasets for data science and machine learning.
The goal of this project is to make it easy to create a dataset and glue it into another Python package like Keras, without having to worry about the sticky parts. In most of datascience, getting your data into the right format is 80% of the work. The aim of PySOC is to make it grab a dataset. For example, the AskReddit module can be used to scrape data from the AskReddit subreddit through the Reddit API and store it in a serialized format, without actually having to download any data. It can then be loaded, iterated through, or otherwise manipulated using helpful abstraction methods.
Each dataset is stored as a module. Data is cached in a user-specified folder, and for datasets that are scraped from public sources, like the AskReddit dataset, a command-line interface is provided for updating the cached data. The modules provide helper methods for interacting with different data types, like text.
Pip can be used to install the package as follows:
pip install git+https://github.com/codekansas/soc
The modules come with command-line tools for downloading data.
>>> pysoc --help
Usage: pysoc <command>
SOC: Data management system.
Options:
-h, --help Show this message and exit.
Commands:
ask_reddit AskReddit command-line interface.
mnist MNIST command-line interface.
nietzsche Nietzsche command-line interface.
>>> pysoc ask_reddit --help
Usage: pysoc ask_reddit download [OPTIONS]
Options:
--fname TEXT
--num_results INTEGER
--override TEXT
--num_comments INTEGER
--time_filter [hour|day|week|month|year|all]
--wait_time FLOAT
-h, --help Show this message and exit.
See the Python notebook here. This example illustrates how to build a sequence-to-sequence neural network and train it in AskReddit question-answer pairs.
See the Constributing Guide.
This project was created for Hack Illinois 2017.