Coder Social home page Coder Social logo

soc's Introduction

SOC

Search Online Collections: Abstracting away datasets for data science and machine learning.

Motivation

The goal of this project is to make it easy to create a dataset and glue it into another Python package like Keras, without having to worry about the sticky parts. In most of datascience, getting your data into the right format is 80% of the work. The aim of PySOC is to make it grab a dataset. For example, the AskReddit module can be used to scrape data from the AskReddit subreddit through the Reddit API and store it in a serialized format, without actually having to download any data. It can then be loaded, iterated through, or otherwise manipulated using helpful abstraction methods.

Technical Details

Each dataset is stored as a module. Data is cached in a user-specified folder, and for datasets that are scraped from public sources, like the AskReddit dataset, a command-line interface is provided for updating the cached data. The modules provide helper methods for interacting with different data types, like text.

Installation

Pip can be used to install the package as follows:

pip install git+https://github.com/codekansas/soc

Command-line Usage

The modules come with command-line tools for downloading data.

>>> pysoc --help
Usage: pysoc  <command>

  SOC: Data management system.

Options:
  -h, --help  Show this message and exit.

Commands:
  ask_reddit  AskReddit command-line interface.
  mnist       MNIST command-line interface.
  nietzsche   Nietzsche command-line interface.

>>> pysoc ask_reddit --help
Usage: pysoc ask_reddit download [OPTIONS]

Options:
  --fname TEXT
  --num_results INTEGER
  --override TEXT
  --num_comments INTEGER
  --time_filter [hour|day|week|month|year|all]
  --wait_time FLOAT
  -h, --help                      Show this message and exit.

Example

See the Python notebook here. This example illustrates how to build a sequence-to-sequence neural network and train it in AskReddit question-answer pairs.

Contribute

See the Constributing Guide.

This project was created for Hack Illinois 2017.

soc's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.